Kernel spectral clustering of large dimensional data

This article proposes a first analysis of kernel spectral clustering methods in the regime where the dimension $p$ of the data vectors to be clustered and their number $n$ grow large at the same rate. We demonstrate, under a $k$-class Gaussian mixture model, that the normalized Laplacian matrix associated with the kernel matrix asymptotically behaves similar to a so-called spiked random matrix. Some of the isolated eigenvalue-eigenvector pairs in this model are shown to carry the clustering information upon a separability condition classical in spiked matrix models. We evaluate precisely the position of these eigenvalues and the content of the eigenvectors, which unveil important (sometimes quite disruptive) aspects of kernel spectral clustering both from a theoretical and practical standpoints. Our results are then compared to the actual clustering performance of images from the MNIST database, thereby revealing an important match between theory and practice.


Introduction
Kernel spectral clustering encompasses a variety of algorithms meant to group data in an unsupervised manner based on the eigenvectors of certain data-driven matrices. These methods are so widely spread that they have become an essential ingredient of contemporary machine learning (see [31] and references therein). This being said, the theoretical foundations of kernel spectral clustering are not unified as it can be obtained from several independent ideas, hence the multiplicity of algorithms to meet the same objective. Denote x 1 , . . . , x n ∈ R p the data vectors to be clustered in k similarity classes and κ : R p × R p → R + some data-affinity function (which by convention takes large values for resembling data pairs and small values for distinct vectors). We shall denote by K the kernel matrix defined by K ij = κ(x i , x j ). One of the original approaches [31] to data clustering consists in the relaxation of the following discrete problem (RatioCut) argmin C1∪...∪C k ={x1,...,xn} k a=1 xi∈Ca xj / ∈Ca where the minimum is taken over disjoint sets C a , a = 1, . . . , k. Here the normalization by |C a | (with | · | the set cardinality) ensures that classes have approximately balanced weights. Letting M ∈ R n×k be the matrix with [M ] ia = |C a | − 1 2 δ {xi∈Ca} (1 ≤ i ≤ n, 1 ≤ a ≤ k) and M the set of such matrices, this can be shown to be strictly the same as Alternatively, the intuition from the popular Ng-Jordan-Weiss algorithm [25] starts off by considering the so-called normalized Laplacian I n − D − 1 2 KD − 1 2 of K and by noticing that, if ideally κ(x i , x j ) = 0 when x i and x j belong to distinct classes (and κ(x i , x j ) = 0 otherwise), then the vectors D 1 2 1 Ca ∈ R n , where 1 Ca is the indicator vector of the class index C a (composed of ones for the indices of C a and zero otherwise), are eigenvectors for the zero eigenvalue of the normalized Laplacian. In practice, κ(x i , x j ) is merely expected to take small values for vectors of distinct classes, so that the algorithm consisting in retrieving classes from the eigenvectors associated to the smallest eigenvalues of the (nonnegative definite) normalized Laplacian matrix will approximately perform the desired task.
Theoretically speaking, assuming x 1 , . . . , x n ∈ R p independently distributed as a k-mixture probability measure, it is proved in [32] that the various spectral clustering algorithms are consistent as n → ∞ in the sense that they shall return the statistical clustering allowed by the kernel function κ. Despite this important result, it nonetheless remains unclear which clustering performance can be achieved for all finite n, p couples. This is all the more needed that spectral clustering methods are being increasingly used in settings where p can be of similar size, if not much larger, than n. In this article, we aim at providing a first understanding of the behavior of kernel spectral clustering as both n and p are large but of similar order of magnitude. To this end, we shall leverage the recent result from [17] on the limiting spectrum of kernel random matrices for independent and identically distributed zero mean (essentially Gaussian) vectors. Our approach is to generalize the latter to a k-class Gaussian mixture model for the normalized Laplacian of K, rather than for the kernel matrix K itself.
Our focus, for reasons discussed below, is precisely on the following version of the normalized Laplacian matrix which we shall from now on refer to, with a slight language abuse, as the Laplacian matrix of K. The kernel function κ will be such that κ(x, y) = f ( x−y 2 /p) for some sufficiently smooth f independent of n, p, that is 1 For x i in class C a , we assume that x i ∼ N (μ a , C a ) with (μ a , C a ) ∈ {(μ 1 , C 1 ), . . . , (μ k , C k )} (k shall remain fixed while p, n → ∞) such that, for each (a, b), μ a −μ b = O (1) while C a = O(1) (here, and throughout the article, · stands for the operator norm) and tr(C a − C b ) = O( √ p). This setting can be considered as a critical growth rate regime in the sense that, supposing tr C a and tr C 2 a to be of order O(p) (which is natural if C a = O(1)) and μ a − μ b = O (1), the norms of the observations in each class C a fluctuate at rate √ p around tr C a , so that clustering ought to be possible so long as tr (C a The technical contribution of this work is to provide a thorough analysis of the eigenvalues and eigenvectors of the matrix L in the aforementioned regime. In a nutshell, we shall demonstrate that there exist critical values for the (intercluster differences between) means μ i and covariances C i beyond which some (relevant) eigenvalues of L tend to isolate from the majority of the eigenvalues, thus inducing a so-called spiked model for L. When this occurs, the eigenvectors associated to these isolated eigenvalues will contain information about class clustering. Our objective is to precisely describe the structure of the individual eigenvectors as well as to evaluate correlation coefficients among these, keeping in mind that the ultimate interest is on harnessing spectral clustering methods. The outcomes of our study shall provide new practical insights and methods to appropriately select the kernel function f . Before delving concretely into our main results, some of which may seem quite cryptic on the onset, we introduce below a motivation example and some visual results of our work.
The proofs of some of the technical mathematical results are deferred to our companion paper [3].
Notations: The norm · stands for the Euclidean norm for vectors and the associated operator norm for matrices. The notation N (μ, C) is the multivariate Gaussian distribution with mean μ and covariance C. The vector 1 m ∈ R m stands for the vector filled with ones. The delta Dirac measure is denoted δ. The operator D(v) = D{v a } k a=1 is the diagonal matrix having v 1 , . . . , v k (scalars of vectors) as its ordered diagonal elements. The distance from a point x ∈ R to a set X ⊂ R is denoted dist(x, X ). The support of a measure ν is denoted supp(ν). We shall often denote {v a } k a=1 a column vector with a-th entry (or block entry) v a (which may be a vector itself), while {V ab } k a,b=1 denotes a square matrix with entry (or block-entry) (a, b) given by V ab (which may be a matrix itself).

Motivation and statement of main results
Let us start by illustratively motivate our work. In Figure 2 are displayed in red the eigenvectors associated with the four largest eigenvalues of L (as defined in (1.2)) for x 1 , . . . , x n a set of (preprocessed 2 ) vectorized images sampled from the popular MNIST database (handwritten digits) [23]. The vector images are of size p = 784 (for images are of size 28 × 28) and we take n = 192 samples, with the first 64 x i 's being images of zeros, next 64 images of ones, and last 64 images of twos. An example of these is displayed in Figure 1 (the vector values follow a grayscale from zero for black to one for white). The kernel function f (defining the kernel matrix K through (1.3)) is taken to be the standard f (x) = exp(−x/2) Gaussian kernel.
As recalled earlier, performing spectral clustering on x 1 , . . . , x n consists here in extracting the leading eigenvectors from L and applying a procedure that identifies the three classes and properly maps every datum to its own class. As can be observed from Figure 2, the classes are visually clearly discriminated, although it also appears that some data stick out which, no matter how thorough the aforementioned procedure, are bound to be misclassified. A particularly common algorithm to cluster the data consists in extracting the entry i from each of the, say l, dominant eigenvectors to form a vector y i ∈ R l . Clustering then resorts to applying some standard unsupervised classification technique, such as the popular k-means method [9,Chapter 9], to the small dimensional vectors y 1 , . . . , y n . In Figure 3 are depicted the vectors y i for l = 2, successively only accounting for the leading eigenvectors one and two (top figure) or two and three (bottom figure) of L. The crosses, each corresponding to a specific y i ∈ R 2 vector, are colored according to their genuine class. As one can observe, for each class there exists a non trivial correlation between the entries of y i , and thus between the eigenvectors of L.
In order to anticipate the performance of clustering methods, and to be capable of improving the latter, it is a fundamental first step to understand the 2 The full MNIST database is preprocessed by discarding from all images the empirical mean and by then scaling the resulting vector images by p over the average squared norm of all vector images. This preprocessing ensures an appropriate match to the base Assumption 1 below. behavior of the eigenvectors of Figure 2 along with their joint correlation, as exemplified in Figure 3. The present work intends to lay the theoretical grounds for such clustering performance understanding and improvement. Namely, we shall investigate the existence and position of isolated eigenvalues of L and shall show that some of them (not always all of them) carry information about the class structure of the problem. Then, since, by a clear invariance property of the model under consideration, each of the dominant eigenvectors of L can be divided class-wise into k chunks, each of which being essentially composed of independent realizations of a random variable with given mean and variance, we shall identify these means and variances. Finally, since eigenvectors are correlated, we shall evaluate the class-wise correlation coefficients.
As a first glimpse on the practical interest of our results, in Figure 2 are displayed in blue lines the theoretical means and standard deviations for each class-wise chunk of eigenvectors, obtained from the results of this article. That is, the means and standard deviations that one would obtain if the data were genuinely Gaussian (which here for the MNIST images they are obviously not). Also, Figure 3 proposes in blue ellipses the theoretical one-and two-standard deviations of the joint eigenvector entries, again if the data were to be Gaussian. It is quite interesting to see that, in spite of their evident non-Gaussianity, the the- oretical findings visually conform to the data behavior. We are thus optimistic that the findings of this work, although restricted to Gaussian assumptions, can be applied to a large set of problems beyond strongly structured ones.
We summarize below our main theoretical contributions and their practical aftermaths, all detailed more thoroughly in the subsequent sections. From a technical standpoint, our main results may be summarized as follows: (1) as n, p → ∞ while n/p = O(1), L −L → 0 (in operator norm) almost surely, where L is a slight modification of L andL is a matrix which is an instance of the so-called spiked random matrix models, as introduced in [7,8,4,5,19] (but closer to the model studied independently in [13]); that is, the spectrum ofL is essentially composed of (one or several) clusters of eigenvalues and finitely many isolated ones. This result is the mandatory ground step that allows for the theoretical understanding of the eigenstructure of L; (2) matrixL only depends on the successive derivatives f ( ) , = 0, 1, 2, of f evaluated a specific value τ (that can in passing be empirically estimated). Any two functions with same first derivatives thus provably exhibit the same asymptotic clustering performances. Besides, different choices of f ( ) (τ ) sets specific emphasis on the importance of the means or covariances μ a , C a in the eigenstructure ofL ; (3) as is standard in spiked models, there exists a phase transition phenomenon by which, the more distinct the classes, the more eigenvalues tend to isolate from the main eigenvalue bulk ofL and the more information is contained within the eigenvectors associated with those eigenvalues. This statement is precisely accounted for by exhibiting conditions for the separability of the isolated eigenvalues from the main bulk, by exactly locating these eigenvalues, and by retrieving the asymptotic values of the class-wise means and variances of the isolated eigenvectors; (4) the eigenvectors associated to the isolated eigenvalues are correlated to one another and we precisely exhibit the asymptotic correlation coefficients.
Aside from these main expected results are some more subtle and somewhat unexpected outcomes: (5) the eigenvectors associated with some of the non extreme isolated eigenvalues of L may contain information about the classes, and thus clustering may be performed not only based on extreme eigenvectors; (6) on the contrary, some of the eigenvectors associated to isolated eigenvalues, even the largest, may be purely noisy; (7) in some specific scenarios, the theoretical number of informative isolated eigenvalues cannot exceed two altogether, while in others as many as k − 1 can be found in-between each pair of eigenvalue bulks of L ; (8) in some other scenarios, two eigenvectors may be essentially the same, so that some eigenvectors may not always provide much information diversity.
From a practical standpoint, the aforementioned technical results, along with the observed adequacy between theory and practice, have the following key entailments: (A) as opposed to classical kernel spectral clustering insights in small dimensional datasets, high dimensional data tend to be "always far from one another" to the point that x i −x j for intra-class data x i and x j may systematically be larger than for inter-class data. This disrupts many aspects of kernel spectral clustering, starting with the interest for non-decreasing kernel functions f ; (B) the interplay between the triplet (f (τ ), f (τ ), f (τ )) and the class-wise means and covariances opens up a new road for kernel investigations; in particular, although counter-intuitive, choosing f non-monotonous may be beneficial for some datasets. In a work subsequent to the present article [12], we show that choosing f (τ ) = 0 allows for very efficient subspace clustering of zero mean data, where the traditional Gaussian kernel f (x) = exp(−x/2) completely fails (the motivation for [12] was spurred by the important Remark 12 below); (C) more specifically, in problems where clustering ought to group data upon specific statistical properties (e.g., upon the data covariance, irrespective of the statistical means), then appropriate choices of kernels f can be made that purposely discard specific statistical information; (D) generally speaking, since only the first three derivatives of f at τ play a significant role in the asymptotic regime, the search for an optimal kernel reduces to a three-dimensional line search. One may, for instance, perform spectral clustering on a given dataset over a finite mesh of values of (f (τ ), f (τ ), f (τ )) ∈ R 3 and select as the "winning output" the one achieving the minimum RatioCut value (as per Equation (1.1)). This method dramatically reduces the search space of optimal kernels; (E) the result of the study of the eigenvectors content, along with point (B) above, allow for a theoretical evaluation of the optimally expectable performance of kernel spectral clustering for large dimensional Gaussian mixtures (and then likely for any practical large dimensional dataset). As such, upon the existence of a parallel set of labelled data, one may prefigure the optimum quality of kernel clustering on similar datasets (e.g., datasets anticipated to share similar statistical structures).
We now turn to the detailed introduction of our model and to some necessary preliminary notions of random matrix theory.

Preliminaries
Let x 1 , . . . , x n ∈ R p be independent vectors belonging to k distribution classes C 1 , . . . , C k , with x n1+...+na−1+1 , . . . , x n1+...+na ∈ C a for each a ∈ {1, . . . , k} (so that each class C a has cardinality n a ), where n 0 = 0 and n 1 + . . . + n k = n. We assume that x i ∈ C a is given by for some μ a ∈ R p and w i ∼ N (0, p −1 C a ), with C a ∈ R p×p nonnegative definite (the factors √ p and p −1 will lighten the notations below). We shall consider the large dimensional regime where both n and p are simultaneously large with the following growth rate assumption.
As discussed in the introduction, the growth rates above were chosen in such a way that the achieved clustering performance be non-trivial in the sense that: (i) the proportion of misclassification remains non-vanishing as n → ∞, and (ii) there exist smallest values of μ • a , 1 √ p tr C • a and 1 p tr C • a C • b below which no isolated eigenvector can be used to perform efficient spectral clustering.
For further use, we now define This quantity is central to our analysis as it is easily shown that, under Assumption 1, almost surely. The value τ , which depends implicitly on n, is bounded but needs not converge as p → ∞. Let f : R + → R + be a function satisfying the following assumptions.

Assumption 2 (Kernel function)
. The function f is three-times continuously differentiable in a neighborhood of the values taken by τ . Moreover, lim inf n f (τ ) > 0.
Define now K to be the kernel matrix Kernel spectral clustering of large dimensional data 1403 From (3.2), it appears that, while the diagonal elements of K are all equal to f (0), the off-diagonal entries jointly converge toward f (τ ). This means that, up to (f (τ ) − f (0))I n , K is essentially a rank-one matrix.
Remark 1 (On the Structure of K). The observation above has important consequences to the traditional vision of kernel spectral clustering. Indeed, while in the low-dimensional regime (small p) it is classically assumed that intraclass data can be linked through a chain of short distances x i − x j , for large p, all x i tend to be far apart. The statistical differences between data, that shall then allow for clustering, only appear in the second order terms in the expansion of K ij which need not be ordered in a decreasing manner as x i and x j belong to "more distant classes". This immediately annihilates the need for f to be a decreasing function, thereby disrupting from elementary considerations in traditional spectral clustering.
As spectral clustering is based on Laplacian matrices rather than on K itself, we shall focus here on the Laplacian matrix is often referred to as the matrix of degrees of K. Aside from the arguments laid out in the introduction, the choice of studying the matrix L also follows from a better stability of clustering algorithms based on L versus K and D − K that we observed in various simulations. Under our growth rate assumptions, the matrix L shall be seen to essentially be a rank-one matrix which is rather simple to deal with since, unlike K, its dominant eigenvector is known precisely to be D 1 2 1 n and it shall be shown that the projected matrix has bounded operator norm almost surely as n → ∞. Indeed, note here that L and L have the same eigenvalues and eigenvectors but for the eigenvalueeigenvector pair (n, D 1 2 1 n ) of L turned into (0, D 1 2 1 n ) for L . Under the aforementioned assumptions, the matrix L will be subsequently shown to have its eigenvalues all of order O (1).
Our first intermediary result shows that there exists a matrixL such that L −L → 0 almost surely, whereL follows an analytically tractable random matrix model. Before going into the result, a few notations need be introduced. In the remainder of the article, we shall use the following deterministic element notations 3 where j a ∈ R n is the canonical vector of class C a , defined such that (j a ) i = δ xi∈Ca , along with the random element notations (recall here that w i is defined in (3.1)) and and the case f (τ ) = 0 is obtained through extension by continuity (f (τ )B being well defined as f (τ ) → 0).
From Theorem 1 it entails that the eigenvalues of L andL converge to one another (as we have as an immediate corollary that max i |λ i (L ) − λ i (L )| a.s. −→ 0; see e.g., [21,Theorem 4.3.7]), so that the determination of isolated eigenvalues in the spectrum of L (or L) can be studied from the equivalent problem forL . More importantly, from Theorem 1, it unfolds that, for every isolated eigenvector u of L and its associatedû ofL , u −û a.s. −→ 0. Thus, the spectral clustering performance based on the observable L (or L) may be asymptotically analyzed through that ofL .
A few important remarks concerning Theorem 1 are in order before proceeding. From a mathematical standpoint, observe that, up to a scaled identity matrix and a constant scale factor, if f (τ ) = 0,L is a random matrix of the so-called spiked model family [2] in that it equals the sum of a somewhat standard random matrix model P W T W P and of a small rank (here up to 2k + 1) matrix UBU T . Nonetheless, it differs from classically studied spiked models in several aspects: (i) U is not independent of W , which is a technical issue that can fairly easily be handled, and (ii) P W T W P itself constitutes a spiked model as P is a low rank perturbation of the identity matrix. 4 As such, as n → ∞, the eigenvalues ofL are expected to be asymptotically the same as those of P W T W P (which mainly gather in bulks) but possibly for finitely many of them which are allowed to wander away from the main eigenvalue bulks. As per classical spiked model results from random matrix theory, it is then naturally expected that, if some of the (finitely many) eigenvalues of UBU T are sufficiently large, those shall induce isolated eigenvalues in the spectrum ofL , the eigenvectors of which align to some extent to the eigenvectors I n is of maximum rank k + 1 and is fully deterministic, hence has eigenvalue-eigenvector pairs immediately related to UBU T .
From a spectral clustering aspect, observe that U is importantly constituted by the vectors j a , 1 ≤ a ≤ k, while B contains the information about the inter-class mean deviations through M , and about the inter-class covariance deviations through t and T . As such, some of the aforementioned isolated eigenvectors are expected to align to the canonical class basis J and we already intuit that this will be true all the more that the matrices M , t, T have sufficient "energy" (i.e., are sufficiently away from zero matrices). Theorem 1 thus already prefigures the behavior of spectral clustering methods thoroughly detailed in Section 4.
A more detailed application-oriented analysis now sheds light on the behavior of the kernel function f . Note that, if f (τ ) → 0, L becomes essentially deterministic as f (τ )P W T W P → 0, this having a positive effect of the alignment between L and J. However, when f (τ ) → 0, M vanishes from the expression ofL , thus not allowing spectral clustering to rely on differences in means. Similarly, if f (τ ) → 0, then T vanishes, and thus differences in "shape" between the covariance matrices cannot be discriminated upon. Finally, if 5f (τ ) 8f (τ ) − f (τ ) 2f (τ ) → 0, then differences in covariance traces are seemingly not exploitable.
This observation leads to the following key remark on the optimal choice of a kernel.

Remark 2 (Optimizing Kernel Spectral Clustering). SinceL only depends
on the triplet (f (τ ), f (τ ), f (τ )) combined to the matrices M , t, and T , it is clear that the asymptotic performance of kernel spectral clustering reduces to a function of these parameters. In practice, as we shall see that τ can be consistently estimated from the whole dataset, spectral clustering optimization can be performed by scanning a grid of values of (f (τ ), f (τ ), f (τ )) ∈ R 3 , rather than scanning over the set of real functions. Besides, if the objective is to specifically cluster data upon their means or covariance traces or covariance shape information, specific choices of f (such as second order polynomials) that meet the previously discussed conditions to eliminate the effects of M , t, or T shoud be made.
An illustrative application of Theorem 1 is proposed in Figure 4, where a three-class example with Gaussian kernel function is considered. Note the extremely accurate spectrum approximation of L byL in this example. Anticipating slightly our coming results, note here that, aside from the eigenvalue zero, two isolated (spiked) eigenvalues are observed, which shall presently be related to the eigenvalues of the small rank matrix UBU T .
Before introducing our technical approach, a few further random matrix notions are needed.

Lemma 1 (Deterministic equivalent). Let Assumptions 1 and 2 hold. Let
Then, as n → ∞, where the (g 1 , . . . , g k ) is the unique vector of Stieltjes transforms solutions, for all such z, to the implicit equations

−→ 0 for all deterministic Hermitian matrix D n and deterministic vectors d i,n of bounded norms.
Denoting Lemma 1 is merely an extension of Propositions 3 and 4 of the companion paper [3], where more details are given about the functions g a . It is also an extension of standard random matrix results such as [29,30,14] to more involved structures (studied earlier in e.g., [11,33]). This result mainly states that the bilinear forms as well as the linear statistics of the resolvent of P W T W P asymptotically tend to be deterministic in a controlled manner. Note in passing that, as shown in [3], g 1 (z), . . . , g k (z) can be evaluated by a provably converging fixed-point algorithm, for all z ∈ C \ S p .

Remark 3 (On
In particular, if S p is constituted of a single connected component (as shall often be the case in practice when c 0 > 1), then G p = {0}.
Besides, if g 1 = . . . = g k = g • (a scenario that shall be thoroughly studied in the course of this article), from [30], it appears that, for all With those notations and remarks at hand, we are now ready to introduce our main results.

Main results
Before delving into the investigation of the eigenvalues and eigenvectors of L, recall from Theorem 1 that the behavior of L is strikingly different if f (τ ) is away from zero or if instead f (τ ) → 0. We shall then systematically study both cases independently. In practice, if f (τ ) only has limit points at zero (so is neither away nor converges to zero), then the following study will be valid up to extracting subsequences of p.

Isolated eigenvalues
Assume first that f (τ ) is away from zero, i.e., lim inf p |f (τ )| > 0. In order to study the isolated eigenvalues and associated eigenvectors of the model, we follow standard random matrix approaches as developed in e.g., [8,19]. That is, to determine the isolated eigenvalues, we shall solve Factoring out P W T W P − ρI n and using Sylverster's identity, the above equation is then equivalent to By Lemma 1 (and some arguments to handle the dependence between U and Q ρ ), U T Q ρ U tends to be deterministic in the large n limit, and thus, using a perturbation approach along with the argument principle, we find that the isolated eigenvalues of P W T W P + UBU T tend to be the values of ρ for which BU T Q ρ U +I 2k+1 has a zero eigenvalue, the multiplicity of ρ being asymptotically the same as that of the aforementioned zero eigenvalue.
All calculus made, we have the following first main results.
Theorem 2 (Isolated eigenvalues, f (τ ) away from zero). Let Assumptions 1 and 2 hold and define the k × k matrix Let ρ be at macroscopic distance from S p and be such that G ρ has a zero eigenvalue of multiplicity m ρ . Then there exists λ p j ≥ · · · ≥ λ p j+mρ−1 (j may depend on p) eigenvalues of L such that As it shall turn out, the isolated eigenvalues identified in Theorem 2 are the only ones of practical interest for spectral clustering as they are strongly related to J. However, some other isolated eigenvalues may be found which we discuss here for completion (and thoroughly in the proof section).
where m ρ ≥ 1 is the multiplicity of zero as an eigenvalue of H ρ . Note in particular that, if t = 0 and H p ∩ S c p = ∅, then such ρ + exists. Figure 7, commented later, provides an example where a ρ + is found amongst the other isolated eigenvalues of L (emphasized here in blue). Note that ρ + only depends on a weighted sum of the tr C 2 i and may even exist when M = 0, t = 0, and T = 0. Intuitively, this already suggests that ρ + is only marginally related to the spectral clustering problem.
Along with the eigenvalue n and the eigenvalues identified in Theorem 3, this enumerates all the (asymptotic) isolated eigenvalues of L.
The two theorems above exhibit quite involved expressions that do not easily allow for intuitive interpretations. We shall see in Section 5 that these results greatly simplify in some specific scenarios of practical interest. We may nonetheless already extrapolate some elementary properties.
Remark 6 (Large and small eigenvalues of D τ,ρ ). If an eigenvalue of D τ,ρ diverges to infinity as n, p → ∞, by the boundedness property of Stieltjes transforms, we find that h(τ, ρ) and Γ ρ remain bounded and, thus, the value ρ cancelling the determinant of G ρ must go to infinity as well. This is the expected behavior of spiked models. This implies in particular that, if, for some i, j, as n, p → ∞ slowly (thus disrupting from our assumptions), there will exist an asymptotically unbounded eigenvalue in the spectrum of L (aside from the eigenvalue n). On the opposite, if for all i, j those quantities vanish as n, p → ∞, then D τ,z is essentially zero in the limit, and thus, aside from the ρ's solution to h(τ, ρ) = 0 (and from the eigenvalue n), no isolated eigenvalue can be found in the spectrum of L.
Remark 7 (About the Kernel). As a confirmation of the intuition captured in Remark 2, it now clearly appears from Theorem 3 that, as f (τ ) = 0, the matrix M does not contribute to the isolated eigenvectors of L and thus the μ i 's can be anticipated not to play any role in the resulting spectral clustering methods.
Similarly, from Theorem 2, if f (τ ) = 0, the cross-variances 1 p tr C • i C • j will not intervene and thus cannot be discriminated over. Finally, letting 5f (τ ) discards the impact of the traces 1 √ p tr C • i . This has interesting consequences in practice if one aims at discriminating data upon some specific properties.

Eigenvectors
Let us now turn to the central aspect of the article: the eigenvectors of L (being the same as those of L , up to reordering).
To start with, note that, in both theorems, the eigenvalue n is associated with the eigenvector D 1 2 1 n . Since the eigenvector is completely explicit (which shall not be the case of the next eigenvectors), it is fairly easy to study independently without resorting to any random matrix analysis. Precisely, we have the following result for it.
Remark 8 (Precisions on D 1 2 1 n ). We can make the value of ϕ explicit as follows. Recalling the definition (3.4) of ψ, almost surely.
Note that the eigenvector D 1 2 1 n may then be used directly for clustering, with increased efficiency when the entries t a = 1 √ p tr C • a of t grow large for fixed 1 p tr C 2 a . But the eigenvector (asymptotically) carries no information concerning M or T and is in particular of no use if all covariances C i have the same trace.
For the other isolated eigenvectors, the study is much more delicate as we do not have an explicit expression as in Proposition 1. Instead, by statistical interchangeability of the class-C a entries of, say, the i-th isolated eigenvectorû i of L, we may writeû where ω i a ∈ R n is a random vector, orthogonal to j a , of unit norm, supported on the indices of C a , where its entries are identically distributed. The scalars α i a ∈ R and σ i a ≥ 0 are the coefficients of alignment to j a and the standard deviation of the fluctuations around α i a ja √ na , respectively. Assuming when needed unit multiplicity for the eigenvalue associated witĥ u i , our objective is now twofold: 1. Class-wise Eigenvector Means. We first wish to retrieve the values of the α i a 's. For this, note that We shall evaluate these quantities by obtaining an estimator for the k × k matrix 1 The diagonal entries of the latter will allow us to retrieve |α i a | and the off-diagonal entries will be used to decide on the signs of α i 1 , . . . , α i k (up to a convention in the sign ofû i ).

Class-wise Eigenvector Inner and Cross Fluctuations.
Our second objective is to evaluate the quantities between the fluctuations of two eigenvectors indexed by i and j on the subblock indexing C a . In particular, letting i = j, σ i,i a = (σ i a ) 2 from the previous definition (4.1). For this, it is sufficient to exploit the previous estimates and to evaluate the quantitiesû T i D(j a )û j . But, to this end, for lack of a better approach, we shall resort to estimating the more involved object 1 p J Tû The two aforementioned steps are successively derived in the next sections, starting with the evaluation of the coefficients α i a .

Eigenvector means (α i a )
Consider the case where f (τ ) is away from zero. First observe that, for λ p j , . . . , λ p j+mρ−1 a group of the identified isolated eigenvalues of L all converging to the same limit (as per Theorem 2 or Remark 4), the corresponding eigenspace is (asymptotically) the same as the eigenspace associated with the corresponding deterministic eigenvalue ρ in the spectrum of P W T W P + UBU T . DenotingΠ ρ a projector on the former eigenspace, we then have, by Cauchy's formula and our approximation of Theorem 1, for a small (positively oriented) closed path γ ρ circling around ρ, this being valid for all large n, almost surely. Using matrix inversion lemmas, the right-hand side of (4.2) can be worked out and reduced to an expression involving the matrix G z of Theorem 2. It then remains to perform a residue calculus on the final formulation which then leads to the following result.
with V r,ρ ∈ C k×mρ and V l,ρ ∈ C k×mρ respectively sets of right and left eigenvectors of G ρ associated with the eigenvalue zero, and G ρ the derivative of G z along z evaluated at z = ρ.
Similarly, when f (τ ) → 0, we obtain, with the same limiting approach as for Theorem 3, the following estimate.
with V 0 r,ρ ∈ C k×mρ and V 0 l,ρ ∈ C k×mρ sets of right and left eigenvectors of G 0 ρ associated with the eigenvalue zero, and G 0 ρ the derivative of G 0 z along z evaluated at z = ρ.
Correspondingly to Remarks 4 and 5, we have the following complementary result for the isolated eigenvalues satisfying h(τ, ρ) → 0.
Remark 9 (Irrelevance of the eigenvectors with h(τ, ρ) → 0). In addition to Theorem 4, it can be shown (see the proof section) that, if ρ + is an isolated eigenvalue as per Remark 4 having multiplicity one, then with similar notations as above almost surely. The same holds for ρ 0 + from Remark 5. As such, as far as spectral clustering is concerned, the eigenvectors forming the (dimension one) eigenspacê Π ρ+ cannot be used for unsupervised classification.
From this remark, we shall from now on adopt the following convention. The finitely many isolated eigenvalue-eigenvector pairs (ρ i ,û i ) of L, excluding those for which (at least on a subsequence) h(τ, ρ) → 0, will be denoted in the order ρ 1 ≥ ρ 2 ≥ . . ., with possibly equal values of ρ i 's to account for multiplicity.
A few further remarks are in order.
Remark 10 (Evaluation of α i a ). Let us consider the case where ρ is an isolated eigenvalue of unit multiplicity with associated eigenvectorû i (thusΠ ρ =û iû T i ). According to (4.1), we may now obtain the expression of α i 1 , . . . , α i k as follows: 1.
1 n1 [J Tû iû T i J] 11 = |α i 1 | allows the retrieval of α i 1 up to a sign shift. We may conventionally call this nonnegative value α i 1 (if this turns out to be zero, we may proceed similarly with entry (2, 2) instead).
2. for all 1 < a ≤ k, ) provides α i a up to a sign shift which is consistent with the aforementioned convention, and thus we may redefine Remark 11 (Eigenspace alignment). An alignment metric between the span ofΠ ρ and the sought-for subspace span(j 1 , . . . , j k ) may be given by with (c −1 ) i = 1/c i and corresponds in practice to the extent to which the eigenvectors of L are close to linear combinations of the base vectors j1 √ n1 , . . . , j k √ n k .
Remark 11 may be straightforwardly applied to observe a peculiar (and of fundamental application reach) phenomenon, when M = 0, t = 0 and the kernel is such that f (τ ) → 0.
Remark 12 (Only T case and f (τ ) → 0). If M = 0, t = 0, and f (τ ) → 0, note that G 0 almost surely. It shall also be seen through an example in Section 5 that this is no longer true in general if f (τ ) is away from zero. As such, there is an asymptotic perfect alignment in the regime under consideration if only T is non vanishing, provided one takes f (τ ) → 0. In this case, it is theoretically possible, as n, p → ∞, to correctly cluster all but a vanishing proportion of the x i .
Remark 12 is somewhat unsettling at first look as it suggests the possibility to obtain trivial clustering by setting f (τ ) = 0, under some specific statistical conditions. This is in stark contradiction with Assumption 1 which was precisely laid out so to avoid trivial behaviors. As a matter of fact, a thorough investigation of the conditions of Remark 12 was recently performed in our follow-up work [12], where it is shown that clustering becomes non-trivial if now 1 . This conclusion, supported by conclusive simulations, explicitly says that classical clustering methods (based on the Gaussian kernel for instance) necessarily fail in the regime where T = O(p − 1 4 ) while by taking f (τ ) = 0 non-trivial clustering is achievable. This observation is used to provide a novel subspace clustering algorithm with applications in particular to wireless communications. See [12] for more details.

Eigenvector fluctuations (σ i a ) 2 and cross fluctuations σ ij a
Let us now turn to the evaluation of the fluctuations and cross-fluctuations of the isolated eigenvectors of L around their projections onto span(j 1 , . . . , j k ). As far as inner fluctuations are concerned, first remark that, from Proposition 1, we already know the class-wise fluctuations of the eigenvector D 1 2 1 n (these are proportional to 1 p tr C 2 a for class C a ), and thus we may simply work on the remaining eigenvectors. We are then left to evaluating (i) the inner and cross fluctuations involving eigenvectorsû i ,û j , for i, j > 1 (i may equal j), and (ii) the cross fluctuations betweenû 1 (that is ( For readability, from now on, we shall use the shortcut notation D a D(j a ). For case (i), to estimate we need to evaluateû T i D aûj . However,û T i D aûj may not be directly estimated using a mere Cauchy integral approach as previously (unless i = j for which alternative approaches exist). To work this around, we propose instead to esti- Indeed, ifû i has a non-trivial projection onto span(j 1 , . . . , j k ) (in the other case,û i is of no interest to clustering), then there exists at least one index a ∈ {1, . . . , k} for which 1 p j T aûi = ca c0 α i a is non zero. The same holds forû j , and thusû T i D aûj can be retrieved by dividing a specific entry (m, l) of (4.3) by the appropriate α i m and α j l . Our approach to evaluate (4.3) is to operate a double-contour integration (two applications of Cauchy's formula) by which, for ρ,ρ two distinct isolated eigenvalues, with obvious notations and with Q z = P W T W P + UBU T − zI n −1 .
For case (ii), note from Proposition 1 that, in the first order,û 1 is essentially the vector 1n √ n of norm 1 plus fluctuations of norm O(n − 1 2 ). If one were to evaluateû T 1 D aûi , i > 1, as previously done, this would thus provide an inaccurate estimate to capture the cross-fluctuations. However, since the fluctuating part ofû 1 is well understood by Remark 8, and is in particular directly related to ψ, defined in (3.4), we shall here estimate instead ψ T D aûi , which can be obtained from ψ T D aûiû T i J √ p , the latter being in turn obtained, in case of unit multiplicity, from Before presenting our results, we need an additional technical result, mostly borrowed from our companion article [3].

Lemma 2 (Further Deterministic Equivalents).
Under the conditions and notations of Lemma 1, for z 1 , z 2 (sequences of ) complex numbers at macroscopic distance from the eigenvalues of P W T W P , as n → ∞, In particular, as a consequence of Lemma 2, we have the following identities almost surely, where we defined Remark 13 (About R z1z2 ). The matrix R z1z2 is strongly related to the derivative of the g a (z)'s. In particular, we have that With this lemma at hand, we are in position to introduce the following classwise fluctuation results.

Theorem 7 (Eigenvector fluctuations, f (τ ) → 0). Under the setting and notations of Theorem 5, as n → ∞,
z1z2 . These results are much more involved than those previously obtained and do not lead themselves to much insight. Nonetheless, we shall see in Section 5 that this greatly simplifies when considering special application cases.
Remark 14 (Relation to (σ i a ) 2 and σ i,j a ). For i, j > 1, from the convention on the signs ofû i andû j given by Remark 10, we get immediately that As for the case i = 1, j > 1, corresponding toû 1 = (1 T n D1 n ) − 1 2 D 1 2 1 n , we may similarly impose the convention that 1 T nû1 > 0 (for all large n), which is easily ensured asû 1 = 1n √ n + o(1) almost surely. Then we find that again for all d ∈ {1, . . . , k} for which α j d = 0. An illustration of the application of the results of Sections 4.2.1 and 4.2.2 to determine the class-wise means, fluctuations, and cross-fluctuations of the eigenvectors, as discussed in the early stage of Section 4.2, is depicted in Figure 5. There, under the same setting as in Figure 4 (that is, three classes with various means and covariances under Gaussian kernel), we display in class-wise colored crosses the n couples ([û i ] a , [û j ] a ), a = 1, . . . , n, for the i-th and j-th dominant eigenvectorsû i andû j of L. The left figure is for (û 1 ,û 2 ) and the right figure for (û 2 ,û 3 ). On top of these values are drawn the ellipses corresponding to the oneand two-dimensional standard deviations of the fluctuations, obtained from the covariance matrix In order to get a precise understanding of the behavior of the spectral clustering based on L, we shall successively constrain the conditions of Assumption 1 so to obtain meaningful results.

Special cases
The results obtained in Section 4 above are particularly difficult to analyze for lack of tractability of the functions g 1 (z), . . . , g k (z) (even though from a numerical point of view, these can be accurately estimated as the output of a provably converging fixed point algorithm; see [3]). In this section, we shall consider three scenarios for which all g a (z) assume a constant value: 1. We first assume C 1 = · · · = C k . In this setting, only M can be discriminated over for spectral clustering. We shall show that, in this case, up to k − 1 isolated eigenvalues can be found in-between each pair of connected components in the limiting support of the empirical spectrum of P W T W P , in addition to the isolated eigenvalue n. But more importantly, we shall show that, as long as f (τ ) is away from zero, the kernel choice is asymptotically irrelevant (so one may take f (x) = x with the same asymptotic performance for instance). 2. Next, we shall consider μ 1 = · · · = μ k and take C a = (1 + p −1/2 γ a )C for some fixed γ 1 , . . . , γ k . This ensures that T vanishes and thus only t can be used for clustering. There we shall surprisingly observe that a maximum of two isolated eigenvalues altogether can be found in the limiting spectrum and thatû 1 (associated with eigenvalue n) and the hypothetical u 2 are extremely correlated. This indicates here that clustering can be asymptotically performed based solely onû 1 . 3. Finally, to underline the effect of T alone, we shall enforce a model in which μ 1 = · · · = μ k , n 1 = · · · = n k = n/k, and C a is of the form D (D 1 , . . . , D 1 , D 2 , D 1 , . . . , D 1 ) with D 2 in the a-th position. There we shall observe that the hypothetical eigenvalues have multiplicity k − 1, thus precluding a detailed eigenvector analysis.

Case C a = C for all a.
Assume that for all a, C a = C (which we may further relax by merely requiring that t → 0 and T → 0 in the large p limit). For simplicity of exposition, we shall require in that case the following.

Assumption 3 (Spectral convergence of C).
As p → ∞, the empirical spectral measure 1 This assumption ensures, along with the results from [30] and [1], that the sequences g 1 (z), . . . , g k (z) all converge towards a unique g(z), solution to the implicit equation This is the Stieltjes transform of a probability measureP having continuous density and support S. Besides, as n, p → ∞, none of the eigenvalues of P W T W P are found away from S ∪ {0}.
In this setting, only the matrix M can be discriminated upon to perform clustering and thus we need to take here f (τ ) away from zero to obtain meaningful results which, since τ → 2 uν(du), merely requires that f (2 uν(du)) = 0. Starting then from Theorem 2 and using the fact that Mc = 0, we get The values of ρ ∈ S = S ∪ {ρ; h(τ, ρ) = 0} for which G ρ (asymptotically) has zero eigenvalues are such that By Sylverster's identity, this is equivalent to looking for such ρ satisfying 0 = det 1 g(ρ) Hence, these ρ's are such that −g(ρ) −1 coincides with one of the eigenvalues of C + M D(c)M T . But from [24,30], the image of the restriction of −1/g to R + is defined on a subset of R excluding the support of ν. More precisely, the image of g : A visual representation of x(g) is provided in Figure 6. Now, since Mc = 0, M has maximum rank k − 1 and thus, by Weyl's interlacing inequalities and Additionally, since t = 0, as per Remark 4, if there exists ρ + away from S such that h(τ, ρ + ) = 0, that is for which −g(ρ u 2 ν(du), then an additional isolated eigenvalue of L may be found.
Gathering the above, we thus have the following corollary of Theorem 2.

Corollary 1 (Eigenvalues for C a = C). Let Assumptions 1-3 hold with
the support of ν and take the convention then there exists an isolated eigenvalue of L, which is asymptotically well ap- , where

Besides, let
there is an additional corresponding isolated eigenvalue in the spectrum of L given by − 2f (τ ) These and n characterize all the isolated eigenvalues of L.
As a further corollary, for C = βI p , we obtain which is a classical separability condition in standard spiked models [22,2,27]. Before interpreting this result, let us next characterize the eigenvectors associated to these eigenvalues. The eigenvector attached to the eigenvalue n is D 1 2 1 n and has already been analyzed and carries no information since t = 0.
We also know that the hypothetical eigenvalue ρ + does not carry any relevant clustering information. We are then left to study the eigenvalues ρ i j . For those (assuming they remain distant from one another), almost surely, where ρ = ρ i j is understood to be any of the ρ i j eigenvalues from Corollary 1. It is easier here to use the fact thatḠ ρ G ρ D(c −1 ) is symmetric with V ρ D(c −1 )V r,ρ = V l,ρ as eigenvectors for its m ρ zero eigenvalues. Thus (1).
Regarding fluctuations, note that the leftmost inverse matrix (I k − Ω z1z2 ) −1 in the definition of R z1z2 (Lemma 2) is merely a rank-one perturbation of the identity matrix, so that, after basic algebraic manipulations, we obtain A careful (but straightforward) development of all the terms in Theorem 6, using the already made remarks, then allows one to obtain the following spectral clustering analysis for the setting under present concern.
Corollary 2 (Spectral Clustering for C a = C). For i ∈ {0, . . . , s + 1}, j ∈ {1, . . . , k i }, let ρ i j and i j be as defined in Corollary 1, assumed of unit multiplicity, with associated eigenvectorû i j in L. Under the model (4.1), denoted hereû with (ω i j ) a ∈ R n supported by C a of unit norm and orthogonal to j a , we find that It is interesting to note, from Corollary 1 and Corollary 2, that all expressions of the relevant quantities in this setting are here completely explicit. This allows for easy interpretations of the results. Quite a few remarks can in particular be made.
Remark 15 (Asymptotic kernel irrelevance). From the results of Corollaries 1 and 2, it appears that, aside from the hypothetical isolated eigenvalue ρ + (the eigenvector of which carries in any case no information), when C a = C for all a ∈ {1, . . . , k}, the choice of the kernel function f is asymptotically of no avail, so long that f (τ ) → 0. This can be interpreted in practice by the fact that, since the data x i are linearly separable and that no difference aside location metrics can be exploited to discriminate them, the so-called "kernel trick", which projects the data on a high dimensional space to improve separability, does not provide any additional gain for clustering.
Remark 16 (On the possibility to cluster). Even though ν would have unbounded support, (5.2) allows isolated eigenvalue-eigenvector pairs to emerge in-between successive clusters of eigenvalues and carry relevant clustering information, however necessarily with imperfect alignment to j 1 , . . . , j k . This disrupts from the standard assumption that only extreme eigenvectors may be exploited for spectral clustering.
Remark 17 (On the interplay between C and M ). All quantities obtained in Corollary 2 rely on Υ ρ i j which is an "isolated" eigenvector of C + M D(c)M T .
Also, and possibly more importantly, since Υ T we find that (σ iĩ jj ) a = o(1) for each a = 1, . . . , k. Therefore, the fluctuating parts of the isolated eigenvectors of L are asymptotically uncorrelated when C = βI p .
Remark 18 (On the existence of useless eigenvectors). Corollary 1 predicts the possibility that h(τ, ρ) = 0 for some ρ, the associated eigenvector of which does no align to j 1 , . . . , j k . In Figure 7, we present such a scenario in a k-class setting for which k +1 (and not only k) isolated eigenvalues of L are found. We depict in parallel the corresponding eigenvectors. As expected, the eigenvector D 1 2 1 n does not visually align to j 1 , . . . , j k . For the other three, note that eigenvectors 2 and 4 do align to j 1 , . . . , j k and are thus the sought-for (at most) k − 1 eigenvectors in this connected component of R \ S . As for eigenvector 3, it shows no strong alignment to j 1 , . . . , j k , and must therefore arise from the solution h(τ, ρ + ) = 0. μ a = μ and C a = (1 + p −1/2 γ a )C for all a. Consider now the somewhat opposite scenario where μ 1 = · · · = μ k and C a = (1 + p −1/2 γ a )C, a ∈ {1, . . . , k}, for some γ 1 , . . . , γ k ∈ R fixed. We shall further denote γ = (γ 1 , . . . , γ k ) T and γ 2 = (γ 2 1 , . . . , γ 2 k ) T . Similar to previously, we shall place ourselves for simplicity under Assumption 3. Although not necessary, it shall also be simpler to assume C • = C (which is always possible up to modifying C, γ 1 , . . . , γ k ). 9 In this case, both scenarios f (τ ) away or converging to zero are of interest. Let us focus first on the former. There G z reduces to

Case
with g(z) given by the implicit equation (5.1). Letting z = ρ ∈ R \ S, the roots of the right-hand matrix are the solutions (if any) to It is easily checked that the former value, corresponding to h(τ, ρ) = 0, does not bring a zero eigenvalue in H ρ and thus, as per Remark 4, does not map an isolated eigenvalue of L.
Assuming now f (τ ) → 0, a similar derivation leads to either which corresponds to h 0 (τ, ρ 0 ) = 0, hence not a solution, or to These results can then be gathered as follows.
This value, along with n are all the isolated eigenvalues of L. Otherwise, if f (τ ) → 0, then the non-zero spectrum of L is composed of the eigenvalue n and the eigenvalue As for the eigenvector projections, assuming first f (τ ) → 0, note that as G ρ is a rank-one perturbation of a scaled identity matrix, the left and right eigenvectors V l,ρ , V r,ρ ∈ R k are respectively proportional to t and D(c)t, so that We thus finally obtain, and, similarly, for f (τ ) → 0, The result of Corollary 3 is quite surprising when compared to Corollary 1. Indeed, while the latter allowed for up to k − 1 isolated eigenvalues to be found outside S , here a maximum of one eigenvalue is available, irrespective of C. This state of fact is obviously linked to tt T being of unit rank while MM T can be of rank up to k − 1. For practical purposes, there being no information diversity, the clustering task is increasingly difficult to achieve as k increases. But this becomes even worse when considering the cross-correlations between D 1 2 1 n and (the hypothetical) eigenvectorû associated with ρ. Precisely, after some calculus, we obtain the counter-part of Corollary 2 as follows.
with ω i a ∈ R n supported by C a of unit norm and orthogonal to j a . Then, we find that, for f (τ ) → 0, with also defined in Corollary 3. Besides, for σ 12 If f (τ ) → 0, the results are unchanged but for the terms c 0 This leads us to the following important remark.
Remark 19 (Irrelevance ofû 2 ). From the expression of in Corollary 3, c T γ 2 is directly proportional to . Thus, for sufficiently large values of c T γ 2 , using a first order development in , we get that (σ 12 a ) 2 (σ 1 a ) −2 (σ 2 a ) −2 1. Since this is the equality case of the Cauchy-Schwarz inequality, we deduce that the eigenvectorŝ u 1 andû 2 tend to have similar fluctuations. Besides, the useful part of (α 1 ) a and (α 2 ) a are directly proportional to c a γ a , thus makingû 1 andû 2 eventually quite similar. Very little information is thus expected to be further extracted fromû 2 beyond clustering based onû 1 . This remark is even more valid when considering f (τ ) → 0, for which the discussion holds true even for small c T γ 2 . Figure 8 provides an interpretation of Remark 19, by visually comparing the results of Section 5.3 and Section 5.2. Precisely here, we see, for the same choice of a kernel (tailored so that both cases exhibit the required number of informative eigenvectors), the distribution of the eigenvectors in the present scenario is quite concentrated along a one-dimensional direction, whereas the former scenario of different μ a 's exhibits a well scattered distribution for the data on the two-dimensional plane. 10

Case μ a = μ, tr C a constant
We now study the effect of the matrix T alone (so imposing μ 1 = · · · = μ k and tr C 1 = · · · = tr C k ). Although it is possible to analyze completely such a scenario (to the least for k = 2), it shall be simpler here to enforce g 1 = · · · = g k . To this end, we shall assume the symmetric model C a = D([1 a−1 ⊗D 1 , D 2 , 1 k−a ⊗ D 1 ]), for a = 1, . . . , k, and for some nonnegative definite D 1 , D 2 ∈ R p/k×p/k . We also suppose that n 1 = · · · = n k . A formulation of Assumption 3 adapted to the present setting comes in also handy.
Assumption 4 (Spectral convergence of (k − 1)D 1 + D 2 ). As p → ∞, the empirical spectral measure 1 p p i=1 δ λi((k−1)D1+D2) converges weakly to ν. This setting ensures that all g i 's are identical and are asymptotically equivalent, under Assumption 4, to the implicitly defined to which we may associate the inverse (5.7) In this scenario, M = 0 and t = 0, so that only T can be used to perform data clustering. Precisely, we have Letting first f (τ ) → 0, it is rather immediate to apply Theorem 2 in which The case f (τ ) → 0 is handled similarly.
If instead f (τ ) = 0, denote Then n, form the asymptotic isolated spectrum of L.
In terms of eigenvectors, the result is also quite immediate from the form of G z and we have the following result.

Corollary 6 (Eigenspace projections, constant mean and trace). Under the assumptions and notations of Corollary 5, if
.
From Corollary 6, we then now have, for f (τ ) = 0 which is all the closer to k − 1 that f (τ ) f (τ ) is small or that 1 p tr (D 1 − D 2 ) 2 is large (for a given tr((k − 1)D 1 + D 2 )). Thus, here the Frobenius norm of , hence a perfect alignment to j 1 , . . . , j k , consistently with Remark 12.

Practical considerations
Echoing Remarks 2 and 7, one important practical outcome of our study lies in the observation that clustering can be performed selectively on either M , t, or T by properly setting the kernel function f . That is, assume the following hierarchical scenario in which a superclass C i is identified via a constant mean μ i but different subclasses of covariances C i,j , j = 1, . . . , i k for some i k . If the objective is to discriminate only the superclasses and not each individual class, then one may design f in such a way that both 5f (τ ) are sufficiently small. Alternatively, if differences in mean appear due to an improper centering of the data, and that the relevant information is carried in the covariance structure, then it is useful to take f (τ ) = 0. This may be useful in image classification for images with different lightness and contrast. These considerations however assume the possibility to tailor the function f according to the value taken by its successive derivatives at τ . To ensure this is possible, note that, under Assumption 1, x j 2 a.s.

−→ τ
and it is thus possible to consistently estimate τ and, hence, to design f to one's purposes.
If such an explicit design of f is not clear on the onset (from the data themselves or the sought for objective), one may alternatively run several instances of kernel spectral clustering for various values of (f (τ ), f (τ ), f (τ )) spanning R 3 . Explicit comparisons can be made between the obtained classes by computing a score (such as the RatioCut score obtained from (1.1)) and selecting the ultimate clustering as the one reaching the highest (or smallest) score. This provides a disruptive approach to kernel setting in spectral clustering, as it in particular allows for non-decreasing kernel functions.
A further practical consideration arises when it comes to selecting a kernel function that may engender negative or large positive values (such as with polynomial kernels). While theoretically valid on Gaussian inputs, for robustness reasons in practical scenarios, it seems appropriate to rather consider more stable families of kernels such as generalized Gaussian kernels of the type f (t) = a exp(−b(t−c) 2 ) (which can be tuned to meet the derivative constraints).

Extension to non-Gaussian settings
The present work strongly relies (mostly for mathematical tractability) on the Gaussian assumption on the data x i . Nonetheless, as is often the case in random matrix theory, the results can be generalized to some extent beyond the Gaussian assumption. In particular, assume now that, for x i in class C a , we take x i = μ a +C 1 2 a z i , where z i is a random vector with independent zero mean unit variance entries z ij having at least finite kurtosis κ E[z 4 ij ] − 3. Then our results may be generalized by noticing that almost surely, for x i in class C a , with diag(C) the operator which sets to zero all off-diagonal elements of C. Since κ ≥ −2, the right-hand term can take any nonnegative value, which shall impact (positively or negatively) the detectability. In particular, this will impact the performances of both scenarios where M = 0 studied in Section 5.
Moving to more general statistical models requires to completely rework the proof of all theorems. An interesting choice for the law of x i , modelling heavy tailed distributions, are elliptical laws with different location and scatter matrices according to classes. These have been widely studied in the recent random matrix literature [16,15]. In this case, x i can be written under the form a z i with z i uniformly distributed on the p-dimensional sphere and t i > 0 a scalar independent of z i . By a similar concentration of measure argument, it is expected that x i − x j 2 shall now converge to (t i + t j )τ instead of τ . This implies that the kernel function f will be exploited beyond its value at τ , opening a wide scope of theoretical and applied investigation. Such considerations are left to future work.

On the growth rate
To better understand our data setting, it is interesting to recall the intuition of Ng-Jordan-Weiss spelled out earlier in Section 1. In a perfectly discriminable setting and for an appropriate choice of f , L would have k eigenvalues equal to n with associated eigenspace the span of {j 1 , . . . , j k } while all other eigenvalues would typically remain of order O (1). When the data become less discriminable, k − 1 of these eigenvalues will become smaller with associated eigenvectors only partially aligning to {j 1 , . . . , j k }. This remains valid until the k − 1 eigenvalues become so small that they merge with the remaining spectrum. This therefore places our study at the cluster detectability limit beyond which clustering becomes (asymptotically) theoretically infeasible. Theorem 2 thus provides here the necessary and sufficient conditions for asymptotic detectability of classes in the Gaussian data setting, which are made explicit in Section 5 in several simple scenarios.
However, the above discussion on detectability assumes f fixed from the beginning. As it turns out from our results, it is often beneficial to take f (τ )/f (τ ) and f (τ )/f (τ ) as small as possible (see how taking f (0) = 0 benefits to alignment to j 1 , . . . , j k for instance). Taking f to be the classically used Gaussian kernel f (x) = exp(− x 2σ 2 ), this implies that σ 2 should be taken small. In turn, this suggests that a more appropriate growth regime is when σ 2 is not fixed but vanishes with n and thus that f should adapt to n. Another remark following the same suggestion is that the Taylor expansion of K ij performed to obtain Theorem 1 naturally discards all the subsequent derivatives of f along with all the next order properties of the data (i.e., only M , t, and T can be discriminated upon). Allowing f to adapt to n would allow for a more flexible analysis. In particular, since x i − x j 2 has O( √ p) fluctuations around τ p, we may expect that, taking f (x) =f ((x − τ ) √ p) for some fixedf function would provide a wider dynamics range to the kernel function. This setting, which is quite challenging to study as it does not lead itself to classical random matrix analysis, remains open.

Preliminary definitions and remarks
As we will regularly deal with uniform convergences of the entries of vectors of growing size, we shall call the union bound on many instances. To this end, we need the following definitions, borrowed from e.g. [6,Def. 2.5]. For x ≡ x n a random variable and u n ≥ 0, we write x = O(u n ) if for any η > 0 and D > 0, n D P(x ≥ n η u n ) → 0 as n → ∞. Unless specified, when x depends, besides the implicit parameter n, on other parameters (if x = x ij for example), the convergences will always be supposed to be uniform in the other parameters. As a consequence of this convention, this O( · ) notation, besides being compatible with sums and products, has the property that the maximum of a collection of at most n C random variables of order O(u n ) is still O(u n ), for any constant C.
As far as multidimensional objects are concerned, let us make the notation O( · ) a bit more precise. Note that definition c) below is quite surprising, as 1 n = √ n, but is in fact perfectly adapted to our context, where we have some matrix error terms which, thanks to some classical probabilistic phenomena, are of smaller order when observed through the vector 1 n than through their maximal eigenvector.
Note also that for M = [m ij ] n i,j=1 a random matrix, as M 2 ≤ tr MM * , we have In the remainder of the proof, to make our expressions shorter and more readable, we shall regularly use the following conventions. Matrices will be indexed by blocks according to the partition C 1 , . . . , C k , with (a, b) being used for block a in rows and b in columns. In particular It is easy to see, using for example Lemmas 3 and 4, that around τ • and control, in operator norm, the order of each resulting matrix term in

Case f (τ • ) away from zero
As a consequence of the above, under the assumption that f (τ • ) = 0, we may write where V is the n × (2k + 4) matrix defined by and A A n + A √ n + A 1 , with A n , A √ n and A 1 are the symmetric matrices .
The division of A into A n , A √ n , and A 1 is obviously related here to the fact that A n has operator norm O(n), A √ n has operator norm O( √ n), while A 1 is of order O(1). It is already interesting to note that, from the result above, K is asymptotically equivalent to a type of spiked models in the sense that V AV T is of finite rank and is summed up to P W T W P which, under reasonable conditions, does not exhibit spikes by itself.

Taylor expansion of
We next address the expansion of n −1 D = n −1 D(K1 n ). For this, using Equation (7.3) and the convention for O 1n ( · ), we write We shall importantly use in what follows the fact that P 1 n = 0 which shall help discard quite a few terms (and which is the main motivation for centering x i and w i in the first place). Let us first provide estimates for the quantities involving A and V that we shall develop. In particular, with the estimate Besides, applying 1 T n to the above estimates gives n a n b (A 1,11 ) n a n b f (τ • ) 4f (τ • ) (t a + t b ) 2 + 2 p 2 tr(C a + C b ) 2 Finally, recalling that W W T = 1 p k a=1 C 1 2 a Z a Z T a C 1 2 a , with C a bounded and Z a standard Gaussian independent across k, we get from [1] that Getting back to n −1 D, with the above estimates, identifying as the leading order term, we have so that, by a further Taylor expansion, where D 2 stands for the squared diagonal matrix.
Taylor expansion of L With the Taylor expansions of K and D − 1 2 in hand, we may now obtain the Taylor expansion of their left-and right-products, so to retrieve that of L. Using the sub-multiplicativity of the operator norm, we precisely find from (7.3) and (7.5) With the same strategy, we then have finally Taylor expansion of L At this point, it is worth mentioning that, although absolutely not fathomable from the expression above, computer simulations suggest that the spectrum of L = nD − 1 2 KD − 1 2 is composed of an isolated eigenvalue of magnitude O(n) and importantly of n − 1 eigenvalues of order O(1). Nonetheless, surprisingly at first, the above approximation of L still contains terms of order O( √ n); this is explained by the fact that those terms result from the Taylor expansion of the leading eigenspace of dimension one. Fortunately, we precisely know the leading eigenvector of L to be D 1 2 1 n , and thus we may project L orthogonally to it without affecting the eigenvalue-eigenvector pairs, but for this single isolated eigenvector. This will allow us to retrieve a matrix, L , the eigenvalues of which are expected to be all of order O(1).
Let us start by evaluating the vector D 1 2 1 n (which is simply the diagonal matrix D 1 2 turned to a vector). We have Then, applying 1 T n and 1 n on each side of D (or alternatively, taking the squared norm of the above), we get so that in particular and n D 1 2 1 n 1 n D 1 2 From the above and the previously derived expression of L = nD − 1 2 KD − 1 2 , we finally retrieve the expression for L = nD − 1 2 KD − 1 2 − n D . Before proceeding to the full calculus, let us focus on the terms of operator norm of order n and √ n. The only term of order n arises from −2f (τ )V A n V = f (τ )1 n 1 n , which is present in both L and n D