Spectral clustering based on local linear approximations

In the context of clustering, we assume a generative model where each cluster is the result of sampling points in the neighborhood of an embedded smooth surface; the sample may be contaminated with outliers, which are modeled as points sampled in space away from the clusters. We consider a prototype for a higher-order spectral clustering method based on the residual from a local linear approximation. We obtain theoretical guarantees for this algorithm and show that, in terms of both separation and robustness to outliers, it outperforms the standard spectral clustering algorithm (based on pairwise distances) of Ng, Jordan and Weiss (NIPS '01). The optimal choice for some of the tuning parameters depends on the dimension and thickness of the clusters. We provide estimators that come close enough for our theoretical purposes. We also discuss the cases of clusters of mixed dimensions and of clusters that are generated from smoother surfaces. In our experiments, this algorithm is shown to outperform pairwise spectral clustering on both simulated and real data.


Introduction
In a number of modern applications, the data appear to cluster near some lowdimensional structures. In the particular setting of manifold learning [51,47,7,22,17], the data are assumed to lie near manifolds embedded in Euclidean space. When multiple manifolds are present, the foremost task is separating them, meaning the recovery of the different components of the data associated with the different manifolds. Manifold clustering naturally occurs in the human visual cortex, which excels at grouping points into clusters of various shapes [41,21]. It is also relevant for a number of modern applications. For example, in cosmology, galaxies seem to cluster forming various geometric structures such as one-dimensional filaments and two-dimensional walls [52,39]. In motion segmentation, feature vectors extracted from moving objects and tracked along different views cluster along affine or algebraic surfaces [35,23,53,10]. In face recognition, images of faces in fixed pose under varying illumination conditions cluster near low-dimensional affine subspaces [31,6,19], or along low-dimensional manifolds when introducing additional poses and camera views.
In the last few years several algorithms for multi-manifold clustering were introduced; we discuss them individually in Section 1.3.3. We focus here on spectral clustering methods, and in particular, study a prototypical multiway method relying on local linear approximations, with precursors appearing in [12,1,48,2,27]. We refer to this method as Higher-Order Spectral Clustering (HOSC). We establish theoretical guarantees for this method within a standard mathematical framework for multi-manifold clustering. Compared with all other algorithms we are aware of, HOSC is able to separate clusters that are much closer together; equivalently, HOSC is accurate under much lower sampling rate than any other algorithm we know of. Roughly speaking, a typical algorithm for multi-manifold clustering relies on local characteristics of the point cloud in a way that presupposes that all points, or at least the vast majority of the points, in a (small enough) neighborhood are from a single cluster, except in places like intersections of clusters. In contrast, though HOSC is also a local method, it can work with neighborhoods where two or more clusters coexist.

Higher-Order Spectral Clustering (HOSC)
We introduce our higher-order spectral clustering algorithm in this section, tracing its origins to the spectral clustering algorithm of Ng et al. [42] and the spectral curvature clustering of Chen and Lerman [12,11].
Spectral methods are based on building a neighborhood graph on the data points and partitioning the graph using its Laplacian [22,34], which is closely related to the extraction of connected components. The version introduced by Ng et al. [42] is an emblematic example-we refer to this approach as SC. It uses an affinity based on pairwise distances. Given a scale parameter > 0 and a kernel φ, define the ith point in the graph with similarity matrix W. Note that I − Z is the corresponding normalized Laplacian. Providing the algorithm with the number of clusters K, SC continues by extracting the top K eigenvectors of Z, obtaining a matrix U ∈ R N ×K , and after normalizing its rows, uses them to embed the data into R K . The algorithm concludes by applying K-means to the embedded points. See Algorithm 1 for a summary.
Algorithm 1 Spectral Clustering (SC) [42] Input: x 1 , x 2 , ..., xN : the data points : the affinity scale K: the number of clusters Output: A partition of the data into K disjoint clusters Steps: 1: Compute the affinity matrix W = (W ij ), with W ij = α(x i , x j ). 2: Compute the Z = (Z ij ) : 3: Extract U = [u 1 , . . . , u K ], the top K eigenvectors of Z. 4: Renormalize each row of U to have unit norm, obtaining a matrix V. 5: Apply K-means to the row vectors of V in R K to find K clusters. 6: Accordingly group the original points into K disjoint clusters.
Spectral methods utilizing multiway affinities were proposed to better exploit additional structure present in the data. The spectral curvature clustering (SCC) algorithm of Chen and Lerman [12,11] was designed for the case of hybrid linear modeling where the manifolds are assumed to be affine, a setting that arises in motion segmentation [35]. Assuming that the subspaces are all of dimension da parameter of the algorithm, SCC starts by computing the (polar) curvature of all (d + 2)-tuples, creating an N ⊗(d+2) -tensor. The tensor is then flattened into a matrix A whose product with its transpose, W = AA , is used as an affinity matrix for the spectral algorithm SC. (In practice, the algorithm is randomized for computational tractability.) Kernel spectral curvature clustering (KSCC) [10] is a kernel version of SCC designed for the case of algebraic surfaces.
The SCC algorithm (and therefore KSCC) is not localized in space as it fits a parametric model that is global in nature. The method we study here may be seen as a localization of SCC, which is appropriate in our nonparametric setting since the manifolds resemble affine surfaces locally. This type of approach is mentioned in publications on affinity tensors [1,48,2,27] and is studied here for the first time, to our knowledge. As discussed in Section 4, all reasonable variants have similar theoretical properties, so that we choose one of the simplest versions to ease the exposition. Concretely, we consider a multiway affinity that combines pairwise distances between nearest neighbors and the residual from the best d-dimensional local linear approximation. Formally, given a set of m ≥ d+2 points, x 1 , . . . , x m , define where dist(x, S) := inf s∈S x − s for a subset S ⊂ R D and A d denotes the set imsart-ejs ver. 2011/11/15 file: higher-order-ejs-112811.tex date: November 30, 2011 Arias-Castro, Chen and Lerman/Higher Order Spectral Clustering 4 of d-dimensional affine subspaces in R D . In other words, Λ d (x 1 , . . . , x m ) is the width of the thinnest tube (or band) around a d-dimensional affine subspace that contains x 1 , . . . , x m . (In our implementation, we use the mean-square error; see Section 3.) Given scale parameters > η > 0 and a kernel function φ, define the following affinity: α d (x 1 , . . . , x m ) = 0 if x 1 , . . . , x m are not distinct; otherwise: where diam(x 1 , . . . , x m ) is the diameter of {x 1 , . . . , x m }. See Figure 1 for an illustration. The circle is of radius /2 and the band is of half-width η. Assuming we use the simple kernel, the m-tuple on the left has affinity α d equal to one, while the other two m-tuples have affinity equal to zero, the first one for having a diameter exceeding and the second one for being 'thicker' than η.
Given data points x 1 , . . . , x N and approximation dimension d, we compute all m-way affinities, and then obtain pairwise similarities by clique expansion [2] (note that several other options are possible [12,48,27]): tuning parameters, and outputs a partition of X . We say that it is 'perfectly accurate' if the output partition coincides with the original partition of X into X 1 , . . . , X K . Our main focus is on relating the sample size N and the separation requirement in (5) (in order for HOSC to cluster correctly), and in particular we let τ and δ vary with N . This dependency is left implicit. In contrast, we assume that d, K, ζ are fixed. Also, we assume that d, τ, K are known throughout the paper (except for Section 2.1 where we consider their estimation). Though our setting is already quite general, we discuss some important extensions in Section 4. We will also consider the situation where outliers may be present in the data. By outliers we mean points that were not sampled near any of the underlying surfaces. We consider a simple model where outliers are points sampled uniformly in (0, 1) D \ k B(S k , δ 0 ) for some δ 0 > 0, in general different from δ. That is, outliers are at least a distance δ 0 away from the surfaces. We let N 0 denote the number of outliers, while N still denotes the total number of data points, including outliers. See Figure 3 for an illustration.

Performance of SC
A number of papers analyze SC under generative models similar to ours [3,54,44,40], and the closely related method of extracting connected components of the neighborhood graph [3,37,9,36]. The latter necessitates a compactly supported kernel φ and may be implemented via a union-of-balls estimator for the imsart-ejs ver. 2011/11/15 file: higher-order-ejs-112811.tex date: November 30, 2011 Arias-Castro, Chen and Lerman/Higher Order Spectral Clustering support of the density [16]. Under the weaker (essentially Lipschitz) regularity assumption Arias-Castro [3] shows that SC with a compactly supported kernel is accurate if (a ∨ b denotes the maximum of a and b and a N b N if a N /b N → ∞ as N → ∞). With the heat kernel, the same result holds up to a √ log N multiplicative factor. See also [37,36], which prove a similar result for the method of extracting connected components under stronger regularity assumptions. At the very least, (7) is necessary for the union-of-balls approach and for SC with a compactly supported kernel, because sep N is the order of magnitude of the largest distance between a point and its closest neighbor from the same cluster [45]. Note that (6) is very natural in the context of clustering as it prevents S from being too narrow in some places and possibly confused with two or more disconnected surfaces. And, when C in (6) is large enough and κ is small enough, it is satisfied by any surface S belonging to S 2 d (κ). Indeed, such a surface resembles an affine subspace locally and (6) is obviously satisfied for an affine surface.
When outliers may be present in the data, as a preprocessing step, we identify as outliers data points with low connectivity in the graph with affinity matrix W, and remove these points from the data before proceeding with clustering. (This is done between Steps 1 and 2 in Algorithm 1.) In the context of spectral clustering, this is very natural; see, e.g., [12,37,3]. Using the pairwise affinity (1), outliers are properly identified if δ 0 − τ satisfies the lower bound in (7) and if the sampling is dense enough, specifically [3], When the surfaces are only required to be of Lipschitz regularity as in (6), we are not aware of any method that can even detect the presence of clusters among outliers if the sampling is substantially sparser.

Performance of HOSC
Methods using higher-order affinities are obviously more complex than methods based solely on pairwise affinities. Indeed, HOSC depends on more parameters and is computationally more demanding than SC. One, therefore, wonders whether this higher level of complexity is justified. We show that HOSC does improve on SC in terms of clustering performance, both in terms of required separation between clusters and in terms of robustness to outliers. Our main contribution in this paper is to establish a separation requirement for HOSC which is substantially weaker than (7) when the jitter τ is small enough. Specifically, HOSC operates under the separation where a ∧ b denotes the minimum of a and b, and sep N is the separation required for SC with a compactly supported kernel, defined in (7). This is proved in Theorem 1 of Section 2. In particular, in the jitterless case (i.e. τ = 0), the magnitude of the separation required for HOSC is (roughly) the square of that for SC at the same sample size; equivalently, at a given separation, HOSC requires (roughly) the square root of the sample size needed by SC to correctly identify the clusters. Left: data. Middle: output from SC. Right: output from HOSC. The sampling is much sparser than in the original paper of Ng et al. [42], which is why SC fails. This figure is part of Figure 13 in Section 3, which displays more numerical experiments.
That HOSC requires less separation than SC is also observed numerically. In Figure 4 we compare the outputs of SC and HOSC on the emblematic example imsart-ejs ver. 2011/11/15 file: higher-order-ejs-112811.tex date: November 30, 2011 Arias-Castro, Chen and Lerman/Higher Order Spectral Clustering 9 of concentric circles given in [42] (here with three circles). While the former fails completely, the latter is perfectly accurate. Indeed, SC requires that the majority of points in an -ball around a given data point come from the cluster containing that point. In contrast, HOSC is able to properly operate in situations where the separation between clusters is so small, or the sampling rate is so low, that any such neighborhood is empty of data points except for the one point at the center. To further illustrate this point, consider the simplest possible setting consisting of two parallel line segments in dimension D = 2, separated by a distance δ > 0, specifically, S 1 := {(t, 0) : t ∈ [0, 1]} and S 2 := {(t, δ) : t ∈ [0, 1]}. Suppose N/2 points are sampled uniformly on each of these line segments. It is well-known that the typical distance between a point on S k and its nearest neighbor on S k is of order O(1/N ); see [45]. Hence, a method computing local statistics requires neighborhoods of radius at least of order 1/N , for otherwise some neighborhoods are empty. From (9), HOSC is perfectly accurate when δ = (log N ) 3 /N 2 , say. When the separation δ is that small, typical ball of radius of order 1/N around a data point contains about as many points from S 1 as from S 2 (thus SC cannot work). See Figure 5 for an illustration. Clustering results obtained by SC (left) and HOSC (right) on a data set of two lines with small separation (δ = 0.005). 100 points are sampled from each line, equally spaced (at a distance 0.01). Note that the inter-point separation on the same cluster is twice as large as the separation between clusters. In this case, SC cannot separate the two lines correctly, as we have argued. In contrast, HOSC performs perfectly when clustering the data, which again agrees with the theory and our expectation. We have also tried increasing the separation δ from 0.005 to 0.025, in which case both SC and HOSC perform correctly.
As a bonus, we also show that HOSC is able to resolve intersections in some (very) special cases, while SC is incapable of that. See Proposition 6 and also Figure 12.
To make HOSC robust to outliers, we do exactly as described above, identifying outliers as data points with low connectivity in the graph with affinity matrix W, this time computed using the multiway affinity (3). The separation and sampling requirements are substantially weaker than (8), specifically, δ 0 − τ is required to satisfy the lower bound in (9) and the sampling This is established in Proposition 5, and again, we are not aware of any method for detection that is reliable when the sampling is substantially sparser. For example, when τ = 0 and we are clustering curves (d = 1) in the plane (D = 2) (with background outliers), the sampling requirement in (8) is roughly N k N 1/2 log(N ), compared to N k N 1/3 log(N ) in (10). In Figure 6 below we compare both SC and HOSC on outliers detection, using the data in Figure 4 but further corrupted with 33.3% outliers.  Figure 15 in Section 3, where more outliers-removal experiments are conducted.

Other Methods
We focus on comparing HOSC and SC to make a strong point that higher-order methods may be preferred to simple pairwise methods when the underlying clusters are smooth and the jitter level is small. In fact, we believe that no method suggested in the literature is able to compete with HOSC in terms of separation requirements. We quickly argue why.
The algorithm of Kushnir et al. [32] is multiscale in nature and is rather complex, incorporating local information (density, dimension and principal directions) within a soft spectral clustering approach. In the context of semisupervised learning, Goldberg et al. [25] introduce a spectral clustering method based on a local principal components analysis (PCA) to utilize the unlabeled points. Both methods rely on local PCA to estimate the local geometry of the data and they both operate by coarsening the data, eventually applying spectral clustering to a small subset of points acting as representative hubs for other points in their neighborhoods. They both implicitly require that, for the most part, the vast majority of data points in each neighborhood where the statistics are computed come from a single cluster. Souvenir and Pless [49] suggest an algorithm that starts with ISOMAP and then alternates in EM-fashion between the cluster assignment and the computation of the distances between points and clusters-this is done in a lower dimensional Euclidean space using an MDS embedding. Though this iterative method appears very challenging to be analyzed, it relies on pairwise distances computed as a preprocessing step to derive the geodesic distances, which implicitly assumes that the points in small enough neighborhoods are from the same manifold. Thus, like the SC algorithm, all these methods effectively rely on neighborhoods where only one cluster dominates. This is strong evidence that their separation requirements are at best similar to that of SC. The methods of Haro et al. [30] and Gionis et al. [24] are solely based on the local dimension and density, and are powerless when the underlying manifolds are of same dimension and sampled more or less uniformly, which is the focus of this paper. The method of Guo et al. [29] relies on minimizing an energy that, just as HOSC, incorporates the diameter and local curvature of m-tuples, with m = 3 for curves and m = 4 for surfaces in 3D, and the minimization is combinatorial over the cluster assignment. In principle, this method could be analyzed with the arguments we deploy here. That said, it seems computationally intractable.

Computational Considerations
Thus it appears that HOSC is superior to SC and other methods in terms of separation between clusters and robustness to outliers, when the clusters are smooth and the jitter is small. But is HOSC even computationally tractable?
Assume K and D are fixed. The algorithm starts with building the neighborhood graph (i.e., computing the matrix W). This may be done by brute force in O(mN m ) flops. Clearly, this first step is prohibitive, in particular since we recommend using a (moderately) large m. However, we may restrict computations to points within distance , which essentially corresponds to using a compactly supported kernel φ. Hence, we could apply a range search algorithm to reduce computations. Alternatively, at each point we may restrict computations to its = ω N log(N ) nearest neighbors, with ω N → ∞, or in a slightly different fashion, adapt the local scaling method proposed in [56] by replacing in α d (x i1 , . . . , x im ) by ( i1 · · · im ) 1/m , where i denotes the distance between x i and its th nearest neighbor. The reason is that the central condition (12) effectively requires that the degree at each point be of order log(N ) m−1 (roughly), which is guaranteed if the -nearest neighbors are included in the computations; see [3,36] for rigorous arguments leading to that conclusion. In low dimensions, D = O(log log N ), a range search and -nearest-neighbor search may be computed effectively with kd-trees in O(N poly(log N )) flops. In higher dimensions, it is essential to use methods that adapt to the intrinsic dimensionality of the data. Assuming that d is small, the method suggested in [8] has a similar computational complexity. Hence, the (approximate) affinity matrix W can be computed in order O(N poly(log N )) + O(N · m ); assuming m ≤ log(N )/(ω N log log(N )), this is of order O(N 1+1/ω N ). This is within the possible choices for m in Theorem 1.
Assume we use the -nearest-neighbor approximation to the neighborhood graph, with = ω N log(N ). Then computing Z may be done in O(N 1+1/ω N ) flops, since the affinity matrix W has at most m = O(N 1/ω N ) non-zero coefficients per row. Then extracting the leading K eigenvectors of Z may be done in O(KN 1+1/ω N ) flops, using Lanczos-type algorithms [15]. Thus we may run the -nearest neighbor version of HOSC in O(N 1+1/ω N ) flops, and it may be shown to perform comparably.
We actually implemented the -nearest-neighbor variant of HOSC and tried it on a number of simulated datasets and a real dataset from motion segmentation. The results are presented in Section 3. The code is publicly available online [13].

Content
The rest of the paper is organized as follows. The main theoretical results are in Section 2 where we provide theoretical guarantees for HOSC, including in contexts where outliers are present or the underlying clusters intersect. We emphasize that HOSC is only able to separate intersecting clusters under very stringent assumptions. In the same section we also address the issue of estimating the parameters that need to be provided to HOSC. In theory at least, they may be chosen automatically. In Section 3 we implemented our own version of HOSC and report on some numerical experiments involving both simulated and real data. Section 4 discusses a number of important extensions, such as when the surfaces self-intersect or have boundaries, which are excluded from the main discussion for simplicity. We also discuss the case of manifolds of different intrinsic dimensions, suggesting an approach that runs HOSC multiple times with different d. And we describe a kernel version of HOSC that could take advantage of higher degrees of smoothness. Other extensions are also mentioned, including the use of different kernels. The proofs are postponed to the Appendix.

Theoretical Guarantees
Our main result provides conditions under which HOSC is perfectly accurate with probability tending to one in the framework introduced in Section 1.2. Throughout the paper, we state and prove our results when the surfaces have no boundary and for the simple kernel φ(s) = 1 {|s|<1} , for convenience and ease of exposition. We discuss the case of surfaces with boundaries in Section 4.2 and the use of other kernels in Section 4.5.

13
Assume that (5) holds with Under these conditions, when N is large enough, HOSC is perfectly accurate with probability at least 1 − N −ρ N .
To relate this to the separation requirement stated in the Introduction, the condition (9) is obtained from (14) by choosing and η equal to their respective lower bounds in (12) and (13).
We further comment on the theorem. First, the result holds if ρ N = ρ and ρ is sufficiently large. We state and prove the result when ρ N → ∞ as a matter of convenience. Also, by (11) and (14), the weakest separation requirement is achieved when m is at least of order slightly less than O(log N ) so that ρ N is of order O(1). However, as discussed in Section 1.4, the algorithm is not computationally tractable unless m = o(log N ). This is another reason why we focus on the case where ρ N → ∞. Regarding the constraints (12)-(13) on and η, they are there to guarantee that, with probability tending to one, each cluster is 'strongly' connected in the neighborhood graph. Note that the bound on is essentially the same as that required by the pairwise spectral method SC [3,36]. In turn, once each cluster is 'strongly' connected in the graph, clusters are assumed to be separated enough that they are 'weakly' connected in the graph. The lower bound (14) quantifies the required separation for that to happen. Note that it is specific to the simple kernel. For example, the heat kernel would require a multiplicative factor proportional to √ log N . So how does HOSC compare with SC? When the jitter is large enough that τ (log(N )/N ) 1/d , we have η ≥ and the local linear approximation contribution to (3) does not come into play. In that case, the two algorithms will output the same clustering (see Figure 7 for an example).  Clustering results obtained by SC (left) and HOSC (right) on the data set of Figure 5, but with separation δ = 0.025 and jitter τ = 0.01. In this example, neither SC nor HOSC can successfully separate the two lines. This example supports our claim that when the jitter is large enough (relative to separation), HOSC does not improve over SC and the two algorithms will output the same clustering.
When the jitter is small enough that τ (log(N )/N ) 1/d , HOSC requires less separation, as demonstrated in Figure 5. Intuitively, in this regime the clusters are sampled densely enough relative to the thickness τ that the smoothness of the underlying surfaces comes into focus and each cluster, as a point cloud, becomes locally well-approximated by a thin band. We provide some numerical experiments in Section 3 showing HOSC outperforming SC in various settings.
Thus, HOSC improves on SC only when the jitter is small. This condition is quite severe, though again, we do not know of any other method that can accurately cluster under the weak separation requirement displayed here, even in the jitterless case. It is possible that some form of scan statistic (i.e., matched filters) may be able to operate under the same separation requirement without needing the jitter to be small, however, we do not know how to compute it in our nonparametric setting-even in the case of hybrid linear modeling where the surfaces are affine, computing the scan statistic appears to be computationally intractable. At any rate, the separation required by HOSC is essentially optimal when τ is of order O(N −1/d ) or smaller. A quick argument for the case d = 1 and D = 2 goes as follows. Consider a line segment of length one and sample N points uniformly at random in its τ -neighborhood, with τ = O(1/N ). The claim is that this neighborhood contains an empty band of thickness of order slightly less than O(1/N 2 ), and therefore cannot be distinguished from two parallel line segments. Indeed, such band of half-width λ inside that neighborhood is empty of sample points with probability (1−λ/τ ) N , which converges to 1 if N λ/τ → 0, and when In regards to the choice of parameters, the recommended choices depend solely on (d, τ, K). These model characteristics are sometimes unavailable and we discuss their estimation in Section 2.1. Afterwards, we discuss issues such as outliers (Section 2.2) and intersection (Section 2.3).

Parameter Estimation
In this section, we propose some methods to estimate the intrinsic dimension d of the data, the jitter τ and the number of clusters K. Though we show that these methods are consistent in our setting, further numerical experiments are needed to determine their potential in practice.
Compared to SC, HOSC requires the specification of three additional parameters. This is no small issue in practice. In theory, however, we recommend choosing d and K consistent with their true values, and η as functions of τ , and m of order slightly less than log(N ). The true unknowns are therefore (d, τ, K). We provide estimators for d and K that are consistent, and an estimator for τ that is accurate enough for our purposes. Specifically, we estimate d and τ using the correlation dimension [28] and an adaptation of our own design. The number of clusters K is estimated via the eigengap of the matrix Z.

The Intrinsic Dimension and the Jitter Level
A number of methods have been proposed to estimate the intrinsic dimensionality; we refer the reader to [33] and references therein. The correlation dimension, first introduced in [28], is perhaps the most relevant in our context, since surfaces may be close together. Define the pairwise correlation function The authors of [28] recommend plotting log Cor( ) versus log and estimating the slope of the linear part. We use a slightly different estimator that allows us to estimate τ too, if it is not too small. The idea is to regress log Cor( ) on log and identify a kink in the curve. See Figure 8 for an illustration. Though several (mostly ad hoc) methods have been proposed for finding kinks, we describe a simple method for which we can prove consistency. Fix then letr ≥ 0 be the smallest such r; otherwise, letr = r N −2D. Defineτ = ρ −r N ; and alsod = D, ifr = 3, andd the closest integer to (A 3 − Ar)/(r log ρ N ), otherwise.
Proposition 1. Consider the generative model described in Section 1.2 with and, if there are N 0 outliers, assume that N − N 0 ≥ N/ρ N . Then the following holds with probability at least 1 − In the context of Proposition 1, the only time thatd is inconsistent is when τ is of order ρ −3 N or larger, in which cased = D; this makes sense, since the We now extend this method to deal with a smaller τ . Consider what we just did. The quantity Cor( ) is the total degree of the -neighborhood graph built in SC. Fixing (d, m), we now consider the total degree of the η-neighborhood graph built in HOSC. Define the multiway correlation function Similarly, we shall regress log Cor d,m ( , η) on log η and identify a kink in the curve ( Figure 9 displays such a curve). Using the multiway correlation function, we then propose an estimatorτ as follows. We assume that the method of Proposition 1 returnedr = r N − 2D, for otherwise we know thatτ is accurate. Choose d =d and m ≥ log(N )(log ρ N ) 2 . Note that this is the only time we require m to be larger than log N .
Then redefiningτ as done above, the following holds with probability at least Now,τ comes close to τ if τ is not much smaller than (log(N )/N ) 2/d . Whether this is the case, or not, the statement of Theorem 1 applies withτ in place of τ in (13).
Though our method works in theory, it is definitely asymptotic. In practice, we recommend using other approaches for determining the location of the kink and the slope of the linear part of the pairwise correlation function (in loglog scale). Robust regression methods with high break-down points, like least median of squares and least trimmed squares, worked well in several examples. We do not provide details here, as this is fairly standard, but the figures are quite evocative.

The Number of Clusters
HOSC depends on choosing the number of clusters K appropriately. A common approach consists in choosing K by inspecting the eigenvalues of Z. We show that, properly tuned, this method is consistent within our model. We implicitly assumed that d and τ are known, or have been estimated as described in the previous section. The proof of Proposition 3 is parallel to that of [3,Prop. 4], this time using the estimate provided in part (A1) of the proof of Theorem 1. Details are omitted. Figure 10 illustrates a situation where the number of clusters is correctly chosen by inspection of the eigenvalues, more specifically, by counting the number of eigenvalue 1 in the spectrum of Z (up to numerical error). This success is due to the fact that the clusters are well-separated, and even then, the eigengap is quite small.
We apply this strategy to more data later in Section 3, and show that it can correctly identify the parameter K in some cases (see Figure 14). In general we do not expect this method to work well when the data has large noise or intersecting clusters, though we do not know of any other method that works in theory under our very weak separation requirements.

When Outliers are Present
So far we have only considered the case where the data is devoid of outliers. We now assume that some outliers may be included in the data as described at the end of Section 1.2. As stated there, we label as outlier any data point with low degree in the neighborhood graph, as suggested in [12,37,3]. Specifically, we compute D as in Step 2 of HOSC, and then label as outliers points x i with degree D i below some threshold. Let ρ N → ∞ slower than any power of N , e.g., ρ N = log N . We propose two thresholds: (O1) Identify as outliers points with degree: (O2) Identify as outliers points with degree: Taking up the task of identifying outliers, only the separation between outliers and non-outliers is relevant, so that we do not require any separation between the actual clusters. We first analyze the performance of (O1), which requires about the same separation between outliers and non-outliers as HOSC requires between points from different clusters in (14).
We now analyze the performance of (O2), which requires a stronger separation between outliers and non-outliers, but operates under very weak sampling requirements.
Proposition 5. Assume that m is as in (11), and In terms of separation, assume that δ 0 − τ > . In addition, suppose that Then with probability at least 1 − N −ρ N , the procedure (O2) identifies outliers without error.
If δ 0 = τ , so that outliers are sampled everywhere but within the τ -tubular regions of the underlying surfaces, then both (O1) and (O2) may miss some outliers within a short distance from some B(S k , τ ). Specifically, (O1) (resp. (O2)) may miss outliers within ∧ ρ N η (resp. within ) from some B(S k , τ ). Using Weyl's tube formula [55], we see that there are order The sampling requirement (16) is weaker than the corresponding requirement for pairwise methods displayed in (8). In fact, (16) is only slightly stronger than what is required to just detect the presence of a cluster hidden in noise. We briefly explain this point. Instead of clustering, consider the task of detecting the presence of a cluster hidden among a large number of outliers. Formally, we observe the data x 1 , . . . , x N , and want to decide between the following two hypotheses: under the null, the points are independent, uniformly distributed in the unit hypercube (0, 1) D ; under the alternative, there is a surface S 1 ∈ S 2 d (κ) such that N 1 points are sampled from B(S 1 , τ ) as described in Section 1.2, while the rest of the points, N − N 1 of them, are sampled from the unit hypercube (0, 1) D , again uniformly. Assuming that the parameters d and τ are known, it is shown in [5,4] that the scan statistic is able to separate the null from the alternative if We are not aware of a method that is able to solve this detection task at a substantially lower sampling rate, and (16) comes within a logarithmic factor from (17). We thus obtain the remarkable result that accurate clustering is possible within a log factor of the best (known) sampling rate that allows for accurate detection in the same setting.

When Clusters Intersect
We now consider the setting where the underlying surfaces may intersect. The additional conditions we introduce are implicit constraints on the dimension of, and the incidence angle at, the intersections. We suppose there is an integer 0 ≤ d int ≤ d − 1 and a finite constant C > 0 such that (The subscript int stands for 'intersection'.) In addition, we assume that for some θ int ∈ (0, π/2], (18) is slightly stronger than requiring that S k ∩ S has finite d int -dimensional volume. If the surfaces are affine, it is equivalent to the condition dim(S k ∩S ) ≤ d int , ∀k = . (19), on the other hand, is a statement about the minimum angle at which any two surfaces intersect. For example, if the surfaces are affine within distance δ of their intersection, then (19) is equivalent to their maximum (principal) angle being bounded from below by θ int . See Figure 11 for an illustration. Fig 11: Illustration of intersecting surfaces. Though the human eye easily distinguishes the two clusters, the clustering task is a lot harder for machine learning algorithms.
The main issue is that there are too many data points at the intersection of the two tubular regions. However, in very special cases HOSC is able to separate intersecting clusters (see Figure 12 for such an example).
In addition, assume that (18) holds. Define Then there is a constant C > 0 such that, with probability at least 1 − C γ N , HOSC is perfectly accurate.
The most favorable case is when τ = 0 and θ int = π/2. Then with our choice of and η in Theorem 1, assuming ρ N increases slowly, e.g., ρ N ≺ log N , we have γ N → 0 if 2d int < d, and partial results suggest this cannot be improved substantially. This constraint on the intersection of two surfaces is rather severe. Indeed, a typical intersection between two (smooth) surfaces of same dimension d is of dimension d − 1, and if so, only curves satisfy this condition. Figure 12 provides a numerical example showing the algorithm successfully separating two intersecting one-dimensional clusters. Thus, even with no jitter and the surfaces intersecting at right angle, HOSC is only able to separate intersecting clusters under exceptional circumstances. Moreover, even when the conditions of Proposition 6 are fulfilled, the probability of success is no longer exponentially small, but is at best of order (1/N ) 1−2dint/d . That said, SC does not seem able to properly deal with intersections at all (see also Figure 12). It essentially corresponds to taking η = in HOSC, in which case γ N never tends to zero. Though the implications of Proposition 6 are rather limited, we do not know of any other clustering method which provably separates intersecting clusters under a similar generative model. This is a first small step towards finding such a method.

Software and Numerical Experiments
We include in this section a few experiments where a preliminary implementation of HOSC outperforms SC, to demonstrate that higher-order affinities can bring a significant improvement over pairwise affinities in the context of manifold clustering.
In our implementation of HOSC, we used the heat kernel φ(s) = exp(−s 2 ). Following the discussion in Section 1.4, at each point we restrict the computations to its nearest neighbors so that we practically remove the locality parameter from the affinity function of (3) and obtain where -NN(x 1 ) is the set of the nearest neighbors of x 1 . For computational ease, we used with η changed by a √ m factor, at most. (In the paper, the standard choice for η is a power of N , while m is of order at most log N , so this factor is indeed negligible.) In practice, we always search a subinterval of [0, 1] for the best working η (e.g., [.001, .1]), based on the smallest variance of the corresponding clusters in the eigenspace (the row space of the matrix V), as suggested in [42]. When the given data contains outliers, the optimal choice of η is based on the largest gap between the means of the two sets of degrees (associated to the inliers and outliers), normalized by the maximum degree. The code is available online [13].

Synthetic Data
We first generate five synthetic data sets in the unit cube (0, 1) D (D = 2 or 3), shown in Figure 13. In this experiment, the actual number of clusters (i.e. K) and dimension of the underlying manifolds (i.e. d) are assumed known to all algorithms. For HOSC, we fix = 10, m = d + 2, and use the subinterval  Figure 13 exhibits the clusters found by each algorithm when applied to the five data sets, respectively. Observe that HOSC succeeded in a number of difficult situations for SC, e.g., when the sampling is sparse, or when the separation is small at some locations.
We also plot the leading eigenvalues of the matrix Z obtained by HOSC on each data set; see Figure 14. We see that in data sets 1, 2, 5, the number of eigenvalue 1 coincides with the true number of clusters, while in 3 and 4 there is some discrepancy between the Kth eigenvalue and the number 1. Though we do not expect the eigengap method to work well in general, Figure 14 shows that it can be useful in some cases. Figure 15 displays some experiments including outliers. We simply sampled points from the unit square (0, 1) 2 uniformly at random and added them as outliers to the first three data sets in Figure 13, with percentages 33.3%, 60% and 60%, respectively. We applied SC and HOSC assuming knowledge of the proportion of outliers, and labeled points with smallest degrees as outliers. Choosing the threshold automatically remains a challenge; in particular, we did not test the theory.
We observe that HOSC could successfully remove most of the true outliers, leaving out smooth structures in the data; in contrast, SC tended to keep isolated high-density regions, being insensitive to sparse smooth structures. A hundred replications of this experiment (i.e., fixing the clusters and adding randomly generated outliers) show that the True Positive Rates (i.e., percentages of correctly identified outliers) for (SC, HOSC) are (58.1% vs 67.7%), (75.4% vs 86.8%) and (76.8% vs 88.0%), respectively.

Real Data
We next compare SC and HOSC using the two-view motion data studied in [10,46]. This data set contains 13 motion sequences: (1) boxes, (2) carsnbus3, (3) deliveryvan, (4) desk, (5) lightbulb, (6) manycars, (7) man-in-office, (8) nrbooks3, (9) office, (10) parking-lot, (11) posters-checkerboard, (12) posters-keyboard, and (13) toys-on-table; and each sequence consists of two image frames of a 3-D dynamic scene taken by a perspective camera (see Figure 16 for a few such sequences). Suppose that several feature points have been extracted from the moving objects in the two camera views of the scene. The task is to separate the trajectories of the feature points according to different motions. This application, which lies in the field of structure from motion, is one of the fundamental problems in computer vision. Given a physical point x ∈ R 3 and its image correspondences in the two views (x 1 , y 1 ) , (x 2 , y 2 ) ∈ R 2 , one can always form a joint image sample y = (x 1 , y 1 , x 2 , y 2 , 1) ∈ R 5 . It is shown in [46] that, under perspective camera projection, all the joint image samples y corresponding to different motions live on different manifolds in R 5 , some having dimension 2 and others having dimension 4. Exploratory analysis applied to these data suggests that the manifolds in this dataset mostly have dimension 2 (see Figure 17). Therefore, we will apply our algorithm (HOSC) with d = 2 to these data sets in order to compare with pairwise spectral clustering (SC-NJW, SC-LS).
We use the following parameter values for the two algorithms. In HOSC, we choose = 20, m = d + 2, η ∈ [.0001, The original data contains some outliers. In fact, 10 sequences out of the 13 are corrupted with outliers, with the largest percentage being about 32%. We first manually remove the outliers from those sequences and solely focus on the clustering aspects of the two algorithms. Next, we add outliers back and compare them regarding outliers removal. (Note that we need to provide both algorithms with the true percentage of outliers in each sequence.) By doing so we hope to evaluate the clustering and outliers removal aspects of an algorithm separately and thus in the most accurate way. Table 1 The misclassification rates and the numbers of true outliers detected by HOSC, SC-NJW and SC-LS. In the clustering experiment, the outliers-free data is used; then the outliers are added back so that each of these algorithms can be applied to detect them.   Table 1 presents the results from the experiments above. Observe that HOSC achieved excellent clustering results in all but two sequences, with zero error on eight sequences, one mistake on sequence (13), and two mistakes on each of (1) and (4). We remark that HOSC also outperformed the algorithms in [10, Table 1], in terms of clustering accuracy, but due to the main aim of this paper, we do not include those results in Table 1. In contrast, each of SC-NJW and SC-LS failed on at least five sequences (with over 15% misclassification rates), both containing the two bad sequences for HOSC. As a specific example, we display in Figure 18 the clusters obtained by both HOSC and SC on sequence (7), demonstrating again that higher order affinities can significantly improve over pairwise affinities in the case of manifold data. Regarding outliers removal, HOSC is also consistently better than SC-NJW and SC-LS (if not equally good).  Figure 17, rightmost plot). In this example, HOSC correctly found the two clusters, using geometric information; in contrast, SC failed because it solely relies on pairwise distances.

When the Underlying Surfaces Self-Intersect
In our generative model described in Section 1.2 we assume that the surfaces are submanifolds, implying that they do not self-intersect. This is really for convenience as there is essentially no additional difficulty arising from selfintersections. If we allow the surfaces to self-intersect, then we bound the maximum curvature (from above) and not the reach. We could, for example, consider surfaces of the form S = f (B d (0, 1)), where f : B d (0, 1) → (0, 1) D is locally bi-Lipschitz and has bounded second derivative. A similar model is considered in [38] in the context of set estimation. Clearly, proving that each cluster is connected in the neighborhood graph in this case is the same. The only issue is in situations where a surface comes within distance from another surface at a location where the latter intersects itself. The geometry involved in such a situation is indeed complex. If we postulate that no such situation arises, then our results generalize immediately to this setting.

When the Underlying Surfaces Have Boundaries
When the surfaces have boundaries, points near the boundary of a surface may be substantially connected with points on a nearby surface. See Figure 19 for an illustration. This is symptomatic of the fact that the algorithm is not able to resolve intersections in general, as discussed in Section 2.3, with the notable exception of clusters of dimension d = 1, as illustrated in the 'two moons' example of Figure 13. An example of a surface with a boundary coming close to another surface. This is a potentially problematic situation for HOSC as the points near the boundary of one surface and close to the other surface may be strongly connected to points from both clusters. Numerically, we show in Figure 13 such an example where HOSC is successful.
If we require a stronger separation between the boundary of a surface and the other surfaces, specifically, with δ ‡ − 2τ > , no point near the boundary of a cluster is close to a point from a different cluster. (A corresponding requirement in the context of outliers would be that outliers be separated from the boundary of a cluster by at least δ 0, ‡ , with δ 0, ‡ − τ > .)

When the Data is of Mixed Dimensions
In a number of situations, the surfaces may be of different intrinsic dimensions. An important instance of that is the study of the distribution of galaxies in space, where the galaxies are seen to cluster along filaments (d = 1) and walls (d = 2) [39]. We propose a top-down approach, implementing HOSC for each dimension d starting at D − 1 and ending at 1 (or between any known upper and lower bounds for d).
At each step, the algorithm is run on each cluster obtained from the previous step, including the set of points identified as outliers. Indeed, when the dimension parameter of the algorithm is set larger than the dimension of the underlying surfaces, HOSC may not be able to properly separate clusters. For example, two parallel segments satisfying the separation requirement of Theorem 1 still belong to a same plane and HOSC with dimension parameter d = 2 would not be able to separate the two line segments. Another reason for processing the outlier bin is the greater disparity in the degrees of the data points in the neighborhood graph often observed with clusters of different dimensions. At each step, the number of clusters is determined automatically according to the procedure described in Section 2.1, for such information is usually not available. The parameters and η are chosen according to (15). Partial results suggest that, under some additional sampling conditions, this top-down procedure is accurate under weaker separation requirements than required by pairwise methods, which handle the case of mixed dimensions seamlessly [3]. The key is that an actual cluster X k , as defined in Section 1.2, is never cut into pieces. Indeed, properties (A1) and (A4) in the proof of Theorem 1, which guarantee the connectivity and regularity (in terms of comparable degrees) of the subgraph represented by X k , are easily seen to also be valid when the dimension parameter of the algorithm is set larger than d. (This observation might explain the success of the SCC algorithm of [12] in some mixed settings when using an upper bound on the intrinsic dimensions.)

Clustering Based on Local Polynomial Approximations
For 1 ≤ d ≤ D − 1 and an integer r ≥ 3, let S r d (κ) be the subclass of S 2 d (κ) of d-dimensional submanifolds S such that, for every x ∈ S with tangent T x , the orthogonal projection S ∩ B(x, 1/κ) → T x is a C r -diffeomorphism with all partial derivatives of order up to r bounded in supnorm by κ. For example, S r d (κ) includes a subclass of surfaces of the form S = f (B d (0, 1)), where f : B d (0, 1) → (0, 1) D is locally bi-Lipschitz and has its first r derivatives bounded. (We could also consider surfaces of intermediate, i.e., Hölder smoothness, a popular model in function and set estimation [18,38].) Given that surfaces in S r d are well-approximated locally by polynomial surfaces, it is natural to choose an affinity based on the residual of the best ddimensional polynomial approximation of degree at most r − 1 to a set of points x 1 , . . . , x m . This may be implemented via the "kernel trick" with a polynomial kernel, as done in [10] for the special case of algebraic surfaces. The main difference with the case of C 2 surfaces that we consider in the rest of the paper is the degree of approximation to a surface S ∈ S r d by its osculating algebraic surface of order r − 1; within a ball of radius , it is of order O( r ).
Partial results suggest that, under similar conditions, the kernel version of HOSC with r known may be able to operate under a separation of the form imsart-ejs ver. 2011/11/15 file: higher-order-ejs-112811.tex date: November 30, 2011 Arias-Castro, Chen and Lerman/Higher Order Spectral Clustering 30 (9), with the exponent 2/d replaced by r/d and, in the presence of outliers, within a logarithmic factor of the best known sampling rate ratio achieved by any detection method [5,4]: Regarding the estimation of τ , defining the correlation dimension using the underlying affinity defined here allows to estimate τ accurately down to (essentially) (log(N )/N ) r/d , if the surfaces are all in S r d (κ). The arguments are parallel and we omit the details.
Thus, using the underlying affinity defined here may allow for higher accuracy, if the surfaces are smooth enough. However, this comes with a larger computational burden and at the expense of introducing a new parameter r, which would need to be estimated if unknown, and we do not know a good way to do that.

Other Extensions
The setting we considered in this paper, introduced in Section 1.2, was deliberately more constrained than needed for clarity of exposition. We list a few generalizations below, all straightforward extensions of our work.
• Sampling. Instead of the uniform distribution, we could use any other distribution with a density bounded away from 0 and ∞, or with fast decaying tails such as the normal distribution. • Kernel. The rate of decay of the kernel φ dictates the range of the affinity (3). Let ω N be a non-decreasing sequence such that N 3m φ(ω N ) → 0. For a compactly supported kernel, ω N = sup{s : φ(s) > 0}, while for the heat kernel, we can take ω N = 2 √ m log N . As we will take m → ∞, φ is essentially supported in [0, ω N ] so that points that are further than ω N apart have basically zero affinity. Specifically, we use the following bounds: The results are identical, except that statements of the form δ − 2τ > Z are replaced with δ − 2τ > ω N Z. • Measure of flatness. As pointed out in the introduction, any reasonable measure of linear approximation could be used instead. Our choice was driven by convenience and simplicity.

Appendix A: Preliminaries
We gather here some preliminary results. Recall that, for a, b ∈ R, a ∨ b := max(a, b); a ∧ b := min(a, b); a + = a ∨ 0. For (a N ),

A.1. Large Deviations Bounds
The following result is a simple consequence of Hoeffding's or Bernstein's inequalities. .

A.2. Some Geometrical Results
We start by quantifying how well a surface S ∈ S 2 (κ) is locally approximated by its tangent. Recall that, for an affine subspace T , P T denotes the orthogonal projection onto T . For any s ∈ S, let T s denote the tangent of S at s. Lemma 2. For any S ∈ S 2 d (κ) and s ∈ S, the orthogonal projection onto T s is injective on B(s, 1/(4κ)) ∩ S and P −1 Ts has Lipschitz constant bounded by Proof. This sort of result is standard in differential geometry. We follow the exposition in [43]. We note that the manifold parameter τ in [43], i.e., the inverse of the condition number, coincides with the manifold's reach. We thus fix here an S ∈ S 2 d (κ) and denote τ := reach(S). Since 1/κ is a lower bound on the reach for manifolds in S 2 d (κ), we have the inequality τ ≥ 1/κ. Fix also a point s ∈ S. Applying [43,Lem. 5.4], we obtain that P Ts is oneto-one on B(s, ) ∩ S for any < τ /2, in particular, < 1/(2κ). We obtain an estimate on the image of P Ts as follows. We note that [43, proof of Lem. 5.3] implies that P Ts (B(s, ) ∩ S) ⊇ B(s, cos arcsin( /(2τ ))) ∩ T s . Furthermore, if ≤ 1/(4κ), cos arcsin( /(2τ )) ≥ cos arcsin(κ /2) ≥ 63/64 > 1/2. (25) imsart-ejs ver. 2011/11/15 file: higher-order-ejs-112811.tex date: November 30, 2011 Arias-Castro, Chen and Lerman/Higher Order Spectral Clustering

32
Combining (24) and (25), we conclude that In particular, for = 1/(4κ), we obtain that the range of P Ts (when applied to B(s, 1/(4κ)) ∩ S) contains the ball B(s, 1/(8κ)) ∩ T s . Next, for any s ∈ B(s, 1/(4κ)) ∩ S, the derivative of the linear operator P Ts in the direction u, a unit vector in T s , is where θ 1 denotes the largest principal angle between the corresponding subspaces. In order to further bound from below the RHS of (27), we couple [43, Props. 6.2, 6.3] and use τ ≥ 1/κ to obtain that Combining (27) and (28) we conclude that P −1 Ts has Lipschitz constant bounded by √ 2 in B(s, 1/(4κ)) ∩ T s . For the inclusions, we use the fact that which appears in [20,Th. 4.18(2)]. This immediately implies the first inclusionwhich actually holds for any > 0 and with κ replaced by κ/2. The second inclusion follows by combining (26) with (29).
Next, we estimate the volume of the intersection of the neighborhood of a surface and a ball centered at a point within that neighborhood.
The following result is on the approximation of a set of points in the neighborhood of a d-dimensional affine subspace by a d-dimensional affine subspace generated by a subset of d + 1 points.

Lemma 4.
There is a constant C > 0 depending only on d such that, if z 1 , . . . , z m ∈ B(L, η), with L ∈ A d and m ≥ d + 2, then there exists H ∈ A d generated by d + 1 points among z 1 , . . . , z m , such that z 1 , . . . , z m ∈ B(H, Cη).
Proof . For points a 1 , . . . , a k , let aspan{a 1 , . . . , a k } denote the affine subspace of minimum dimension passing through a 1 , . . . , a k . Let (i 1 , i 2 ) ∈ argmax i,j z i −z j and, for d ≥ k ≥ 3, Without loss of generality, assume that z i1 is the origin, which allows us to identify a point z with the vector z−z i1 . Take z ∈ {z 1 , . . . , z m } and express it as z = a 1 v 1 + · · · + a d v d + w, with w ⊥ A d . We show that w ≤ Cη for a constant C depending only on d, which implies that z ∈ B(A d , Cη). Let C 1 > 0, to be made sufficiently large later. By construction, Hence, for C 1 large enough, q 1 , . . . , q d are linearly independent, and therefore span L. Suppose this is the case and define matrices V with columns v 1 , . . . , v d and Q with columns q 1 , . . . , q d . Then, by continuity, for C 1 large enough we have where · here denotes the (Euclidean) operator norm. When C 1 is that large, we have so that P L w ≤ P L ⊥ w . Now, using the triangle inequality, Because z ∈ B(L, η), we have P L ⊥ z ≤ η. For the other terms, we have P L ⊥ v k ≤ η/λ k as before, and, using the fact that, by construction, the v 1 , . . . , v d are orthonormal with A k = span{v 1 , . . . , v k } and P A ⊥ k−1 z ≤ λ k , together with the Cauchy-Schwartz inequality, we also have Hence, the RHS in (30) is bounded by (d + 1)η, implying We then let C = max(C 1 , 2d + 2).
Below we provide an upper bound on the volume of the three-way intersection of the neighborhood of a surface, a ball centered at a point on the surface and the neighborhood of an affine d-dimensional subspace passing through that point, in terms of the angle between this subspace and the tangent to the surface at that same point. The principal angles between linear subspaces L, L ∈ A d , denoted by Note that the orthogonality constraints are void when r = 1. (Some authors use the reverse ordering, e.g, [26].) Lemma 5. Consider a surface S ∈ S 2 d (κ). Suppose ≥ η ∨ τ , η ≥ 2 and τ > 0. Let Ψ be the uniform distribution on B(S, τ ). For s ∈ S, let T s be the tangent space to S at s. Then for L ∈ A d containing s, Proof.
We divide the proof into two cases; though the proof is similar for both, the first case is simpler and allows us to introduce the main ideas with ease before generalizing to the second case. Case 2 ≤ τ . We use Lemma 2 and the fact that τ ≥ 2 , to get Ignoring the constant factor 1 + κ, we bound We may assume without loss of generality that s is the origin and Therefore, From that we obtain the desired bound. A companion of the previous result, the following lemma provides a lower bound on the angle between the affine subspace and the tangent. Lemma 6. Let , η > 0, and take S ∈ S 2 d (κ). Suppose L ∈ A d is such that B(L, η) contains s ∈ S and y ∈ B(s, ). Let T s the tangent to S at s. Then Proof. Let T denote T s for short, and let L be the line passing through s and P L (y). Since L ⊂ L, we have θ 1 (L, T ) ≥ θ 1 (L , T ), and using the triangle inequality and the fact that θ ≥ sin θ, for θ ≥ 0, this is bounded below by imsart-ejs ver. 2011/11/15 file: higher-order-ejs-112811.tex date: November 30, 2011 Arias-Castro, Chen and Lerman/Higher Order Spectral Clustering
Next is another result estimating some volume intersections. It is similar to Lemma 5, though the conditions are different.
Lemma 7. Consider a surface S ∈ S 2 d (κ). Let Ψ be the uniform distribution on B(S, τ ). Then for ≥ η and τ > 0, where the supremum is over y ∈ R D and L ∈ A d , and the implicit constants depend only on κ, d. Also, for ≥ 10η, η ≥ 10κ 2 and τ > 0, and any x ∈ B(S, τ ), sup Proof. The proof is similar to that of Lemma 5. We divide the proof into two parts. In both cases, we conclude with Lemma 3. Lower bound. Let s be the point on S closest to x, with tangent subspace T . When η ≥ 2τ + 4κ 2 , take as L the translate of T passing through x and use Lemma 2 to get B(S, τ ) ∩ B(x, ) ⊂ B(T, τ + κ(τ + ) 2 ) ⊂ B(L, η), and therefore We then use Lemma 3. Now, suppose η ≤ 2τ + 4κ 2 and notice that, since η ≥ 10κ 2 , we have τ ≥ 3κ 2 . First, assume that ≥ 10τ . We use Lemma 2 to get and therefore, Without loss of generality, assume that x is the origin, L = span{e 1 , . . . , e d }.
Since the volume is least when x − s = τ , assume that s = τ e d+1 (seen as a point in space). Define ν = (η + 2κ 2 )/2 and note that ν ≤ η ∧ (2τ ) by the conditions on η and τ . Then By the conditions imposed on , η, τ , the RHS in (32) contains Therefore the result. Finally assume that τ ≥ /10 and take L passing through We then conclude with Lemma 3. Proof. The proof is parallel to (and simpler than) that of Lemma 7. We omit details.

A.3. A Perturbation Bound
In the proof of Theorem 1, we follow the strategy outlined in [42] based on verifying the following conditions (where (A4) has been simplified). Let I k = {i : x i ∈ X k } and letW k denote the matrix with coefficients indexed by i, j ∈ I k and defined as LetW ij = 0 if i ∈ I k , j ∈ I , with k = . Those are the coefficients of W and D under infinite separation, i.e.,assuming δ = ∞. (In fact δ > + 2τ is enough since we use the simple kernel.) (A1) For all k, the second largest eigenvalue ofW k is bounded above by 1 − γ.
(A2) For all k, , with k = , (A3) For all k and all i ∈ I k , (A4) For all k and all i, j ∈ I k ,D i ≤ QD j .
The following result is a slightly modified version of [42,Th. 2], stated and proved in [3,Th. 7]. See also [11,Th. 4.5]. Recall the matrix V defined in Algorithm 1.

Appendix B: Main Proofs
For a set A, its cardinality is denoted by #A. Throughout the paper, C denotes a generic constant that does not depend on the sample size N and satisfies C ≥ 1.

B.1. Proof of Theorem 1
Given Theorem 2, we turn to proving that the four conditions (A1)-(A4) hold with probability tending to one with ν 1 = ν 2 2 = (ρ N /ζ) −m/2 , γ > C −m N −2 and Q ≤ C m for some constant C > 0. Since m log(ρ N /ζ) log N , this implies Therefore, since the r k 's are themselves orthonormal, the K-means algorithm with near-orthogonal initialization outputs the perfect clustering. We restrict ourselves to the case where τ ≤ (ρ 2 N log(N )/N ) 1/d , for otherwise η ≥ and HOSC is essentially SC, studied in [3]. With that bound on τ , (12) reduces to ≥ (ρ 2 N log(N )/N ) 1/d . By the same token, we assume that η ≤ , so that ≥ η ≥ τ + ρ N 2 . To verify conditions (A2), (A3) and (A4) we need to estimate the degree of each vertex under infinite separation and the edge weights under finite separation. We start with the case of infinite separation. Proposition 7. With probability at least 1 − N −ρ 2 N /(Kζ) , and also,D uniformly over i, j ∈ I k and k = 1, . . . , K.
Proof. Within a cluster, the linear approximation factor in (3) is a function of the proximity factor. This is due to Lemma 2. Formally, letG i, denote the degree of x i in the neighborhood graph built by SC, i.e.
Then Proposition 7 is a direct consequence of Lemma 9, which relatesG i, to W ij andD i , and of Proposition 8, which estimatesG i, .
Lemma 9. We have Note that r {m} ≤ r m , and r {m} ≥ (r/3) m for r ≥ m.
Proof. We focus on the first expression, as the second expression is obtained by summing the first one over j ∈ I k , j = i, where k is such that i ∈ I k . Therefore, fix i, j ∈ I k . The upper bound onW ij comes from the fact that The lower bound comes from and the fact that, where s i is the point on S k closest to x i . Indeed, take x ∈ B(S k , τ ) ∩ B(x i , /2) and let s ∈ S k such that x − s ≤ τ . By the triangle inequality, s − s i ≤ /2 + 2τ , so that, by Lemma 2, s ∈ B(T si , κ( /2+2τ ) 2 ). Therefore, x ∈ B(T si , κ( /2+ 2τ ) 2 + τ ). We then conclude with the fact that τ ≤ and η ≥ τ + ρ N 2 , with ρ N → ∞.
Note that N ≤ KζN k , which together with (12) implies The following bound onG i, is slightly more general than needed at this point.
Proof. This is done in the proof of [3, Eq. (A4)] and we repeat the arguments here for future reference. Let Ψ k denote the uniform distribution on B(S k , τ ). By definition, for any (measurable) set A, SinceG i, is the sum of independent Bernoulli random variables, by Lemma 1, it suffices to bound it in expectation. Using Lemma 3, we have Applying Lemma 1 and (35), we then get We then apply the union bound and use the fact that N · N −2(ρ 2 N /(Kζ)) ≤ N −ρ 2 N /(Kζ) , since ρ 2 N → ∞. We now turn to bounding the size of the edge weights W ij under finite separation. We do so by comparing them with the edge weights under infinite separation.
Proof. If k = , W ij −W ij is the sum of α d (x i , x j , x i1 , . . . , x im−2 ) over (distinct) i 1 , . . . , i m−2 that are not all in I k . When k = ,W ij = 0 and W ij is again the same sum except this time over all (distinct) i 1 , . . . , i m−2 . Both situations are similar and we focus on the latter. We assume that x i − x j ≤ , for otherwise the bound is trivially satisfied. Note that this implies that which is the equivalent ofG i, under finite separation, as well as and H * i,j, ,η = max where the maximum is over all M ⊂ {1, . . . , N }, of size |M | = d + 1 such that x j ∈ B(L M , η). Then Proposition 9 is a direct consequence of Lemma 10, which relates G i, and H * i,j, ,η to W ij , and of Propositions 10 and 11, which bound G i, and H * i,j, ,η , respectively. Lemma 10. There is a constant C > 0 such that Proof. By definition of the affinity (3) and the triangle inequality, we have uniformly over i = 1, . . . , N .
Proof. We have Now, by Lemma 3, for all such that dist(x i , S ) ≤ + τ , Hence, We then use Lemma 1 and (35).
Proof. For L ∈ A d , H i, ,η (L) is a sum of independent Bernoulli random variables, with expectation η)).
Take such that B(S , τ ) ∩ B(x i , ) ∩ B(L, η) = ∅, and let x be in that set and s be the point on S closest to x. Then by the triangle inequality and the fact that ≥ η ≥ τ , where L s is the translate of L passing through s. Therefore, Our focus is on L such that x i , x j ∈ B(L, η), which transfers as x i , x j ∈ B(L s , 3η) by the triangle inequality. Since x i and x j belong to different clusters, for a given , at least one of them does not belong to X . Hence, by Lemma 6 and the fact that δ η ≥ τ + κ 2 , θ 1 (L, T s ) δ/ uniformly over s ∈ S and . (Remember that θ 1 (L, T ) denotes the largest principal angle between L and T .) Together with Lemma 5, we thus get Ψ (B(s, 3 ) ∩ B(L s , 3η)) ≤ C d (η/δ).
Hence, by the fact that δ ≥ ρ N η, we have With Lemma 1 and (12), we then get Hence, by the union bound, The right hand side is bounded by N −ρ N eventually.
• Verifying (A1): As suggested in [42], we approach this through a lower bound on the Cheeger constant. LetZ k be the matrix obtained fromW k following SC. ThatZ k has eigenvalue 1 with multiplicity 1 results from the graph being fully connected [14]. The Cheeger constant ofW k is defined as: where the minimum is over all subsets I ⊂ I k of size |I| ≤ N k /2. The spectral gap ofZ k is then at least h 2 k /2. By (33)-(34), there is a constant C > 0 such that, From here, the proof is identical to that of [3, Eq. (A1)], which bounds the minimum from below by 1/N k , so that h k ≥ C −m N −1 k .

B.2. Proof of Proposition 6
From the proof of Theorem 1, it suffices to verify that (A2) and (A3) still hold under the conditions of Proposition 6, and in view of (19), we may focus on W ij for i ∈ I k and j ∈ I , with k = , such that x i − x j ≤ and with x j close to an intersection, specifically, for some p = , dist(x j , S ∩ S p ) ≤ ν, where ν := (sin θ int ) −1 ( ∧ ρ N η).
In fact, we show that, under the conditions of Proposition 6, with probability at least 1 − γ N , there is no such a pair of points (x i , x j ). For fixed (k, , p), the probability that x i ∼ Ψ k and x j ∼ Ψ satisfy these conditions is after integrating over x i . By Lemma 3, where the implicit constant depends only on κ, d. Moreover, by condition (18), Therefore, using the union bound, the probability that there is such a pair of points is of order not exceeding k,

B.3. Proof of Propositions 4 and 5
Without loss of generality, we assume that δ 0 is small and that η ≤ /10. Let In view of how the procedures (O1) and (O2) work, we need to bound the degrees of non-outliers from below and the degrees of outliers from above. The following lower bound holds N k d (1 ∧ (η/τ )) D−d ≥ (ρ N /(Kζ)) log N, ∀k = 1, . . . , K.
We prove a result that is more general than what we need now.
For the upper bound, by Lemma 3 and the simple bound we have with N − N 0 ≤ (Kζ)N k for any k = 0. As in as in Proposition 8, we then use Lemma 1 together with (46) and the union bound, to conclude the proof of (51). The proof of (52) is identical, except that, when δ 0 > τ + , we have Ψ (B(x i , )) = 0 if = 0 and i ∈ I 0 .
Proposition 14. If (46) holds, then with probability at least 1 − N −ρ N /(Kζ) , uniformly over i ∈ I k and k = 0; and also, uniformly over i ∈ I 0 Proof. First assume that i ∈ I k with k = 0. For the lower bound in (53), let L be a subspace such that which exists by the lower bound in Lemma 7. We have H i, ,η ≥ H i, ,η (L), and the term on the right hand side is a sum of independent Bernoulli random variables with expectation E (H i, ,η (L)) = N k Ψ k (B(x i , ) ∩ B(L, η)) N k d (1 ∧ (η/τ )) D−d .
We then apply Lemma 1, using (46), and the union bound. For the upper bound in (53), the arguments are the same as in the proof of (43), except for the following bound in expectation, valid for any L ∈ A d , E (H i, ,η (L)) = N Ψ (B(x i , ) ∩ B(L, η)) by Lemmas 7 and 8. Now, assume that i ∈ I 0 . Again, the arguments are the same as in the proof of (43), except that the bounds in expectation are different. Specifically, if δ 0 > + τ , then Ψ (B(x i , ) ∩ B(L, η)) = 0, ∀ = 0, so that, by Lemma 8, for any L ∈ A d , E (H i, ,η (L)) ≺ N 0 d η D−d . Otherwise, We are now in a position to prove Propositions 4 and 5. We first consider (O1). By (48) and (49), and the fact that τ ≤ η ≤ ρ On the one hand, by (48), D 1/(m−1) i N k d (N/ρ N ) d , uniformly over i ∈ I k , ∀k = 0. Hence, since ρ N → ∞, no non-outlier is identified as an outlier. On the other hand, by (49), for any i ∈ I 0 , Hence, all outliers are identified as such.
We now consider (O2). On the one hand, by (48) and (16), and the expression for and η, we have uniformly over k = 0 and i ∈ I k . Hence, no non-outlier is identified as an outlier. On the other hand, by (49), for any i ∈ I 0 , which comes from m log(N )/ log(ρ N ). Hence, all outliers are identified as such.
From the first part, we see thatr ≥ r 0 ∧ (r N − 2D/d ), since d ≤ D − 1 and ρ N → ∞. To use the second part, note that r 0 + 2 ≤ r * N if, and only if, r 0 ≤ r N − 2D/d . If this is the case,r ≤ r 0 + 1. From this follows the statement in Proposition 1.

C.2. Proof of Proposition 2
We follow the proof of Proposition 1. We assume thatd = d, which happens with probability tending to one. Let η s = ρ −r−s From here the arguments are parallel to those used in Proposition 1.