NFFT meets Krylov methods: Fast matrix-vector products for the graph Laplacian of fully connected networks

The graph Laplacian is a standard tool in data science, machine learning, and image processing. The corresponding matrix inherits the complex structure of the underlying network and is in certain applications densely populated. This makes computations, in particular matrix-vector products, with the graph Laplacian a hard task. A typical application is the computation of a number of its eigenvalues and eigenvectors. Standard methods become infeasible as the number of nodes in the graph is too large. We propose the use of the fast summation based on the nonequispaced fast Fourier transform (NFFT) to perform the dense matrix-vector product with the graph Laplacian fast without ever forming the whole matrix. The enormous flexibility of the NFFT algorithm allows us to embed the accelerated multiplication into Lanczos-based eigenvalues routines or iterative linear system solvers and even consider other than the standard Gaussian kernels. We illustrate the feasibility of our approach on a number of test problems from image segmentation to semi-supervised learning based on graph-based PDEs. In particular, we compare our approach with the Nystr\"om method. Moreover, we present and test an enhanced, hybrid version of the Nystr\"om method, which internally uses the NFFT.


Introduction.
Graphs are a fundamental tool in the modeling of imaging and data science applications [44,37,2,3,15]. To apply graph-based techniques, individual data points in a data set or pixels of an image represent the vertex set or nodes V of the graph, and the edges indicate the relationship between the vertices. In a number of real-world examples, the graph is sparse in the sense that each vertex is only connected to a small number of other vertices, i.e., the graph affinity matrix is sparsely populated. In other applications, such as the mentioned data points or image pixels, the natural choice for the graph would be a fully connected graph, which is then reflected in dense matrices that represent the graph information. Naturally, if there is no underlying graph the most natural choice is the fully connected graph. As the eigenvectors of the corresponding graph Laplacian are crucial in reducing the complexity of the underlying problem or for the extraction of quantities of interest [4,5,37], it is important to compute them accurately and fast. If this matrix is sparse, numerical analysis has provided efficient tools based on the Lanczos process with sparse matrix-vector products that can compute the eigeninformation efficiently. For complex interactions leading to dense matrices, these methods suffer from the high cost of the matrix-vector product.
Our goal is hence to obtain the eigeninformation despite the fact that the graph is fully connected and without any a priori reduction of the graph information. For this we rely on a Lanczos procedure based on [1]. This method needs to perform the matrix-vector product in a fast way and thus, evaluating all information, without ever fully assembling the graph matrices. In a similar fashion the authors in [5] utilize the well-known Nyström method to only work with partial information from the graph and only approximately represent the remaining parts. Such methods are well-known within the fast solution of integral equations and have found applicability within the data science community [9,22]. The technique we present here is known as a fast summation method [31,32] and is based on the nonequispaced fast Fourier transform (NFFT), see [18] and the references therein. We apply this method in the setting where the weights of the edges between the vertices are modelled by a Gaussian kernel function of medium to large scaling parameter, such that the Gaussian is not well-localized and most vertices interact with each other. For the case of a smaller scaling parameter and consequently a more localized Gaussian, we refer to [26], which is partially based on a technique presented [46] for Gaussian kernels. Moreover, we remark that the NFFT-based fast summation method considered in this paper does not only support Gaussians but can handle various other rotational invariant functions.
The remaining parts of this paper are structured as follows. In Section 2, we first introduce the graph Laplacian and discuss the matrix structure. In Section 3, we introduce the NFFT-based fast summation, which allows for computing fast matrix-vector products with the graph Laplacian. In Section 4, we then recall Krylov subspace methods and in particular the Lanczos method, which sits at the engine room of the numerical computations to obtain a small number of eigenvectors. We then show that the graph Laplacian provides the ideal environment to be used together with the NFFT-based fast summation, and we obtain the NFFT-based Lanczos method. In Section 5 we briefly discuss the Nyström method as a direct competitor to our approach. We improve and accelerate this method, creating a new hybrid Nyström-Gaussian-NFFT version, which incorporates the NFFT-based fast summation. In Section 6, we present comparisons between the NFFT-based Lanczos method, the Nyström method and the hybrid Nyström-Gaussian-NFFT method with the direct application of the Lanczos method for a dense, large-scale problem. Additionally, we illustrate on a number of exemplary applications, such as spectral clustering and semi-supervised learning, that our approach provides a convenient infrastructure to be used within many different schemes.
2. The graph Laplacian and fully connected graphs. We consider an undirected graph G = (V, E) with the vertex set V = {v j } n j=1 and the edge set E, cf. [8] for more information. An edge e ∈ E is a pair of nodes (v j , v i ) with v j = v i and v j , v i ∈ V . For weighted undirected graphs, such as the ones considered in this paper, we also have a weight function w : V ×V → R with w(v j , v i ) = w(v i , v j ) for all j, i. We assume further that the function is positive for existing edges and zero otherwise. The degree of the vertex v j ∈ V is defined as Let W, D ∈ R n×n be the weight matrix and the diagonal degree matrix with entries W ji = w(v j , v i ) and D jj = d(v j ). Since we do not permit graphs with loops, W is zero on the diagonal. Now the crucial tool for further investigations is the graph Laplacian L defined via i.e. L = D − W. The matrix L is typically known as the combinatorial graph Laplacian and we refer to [44] for an excellent discussion of its properties. Typically its normalized form is employed for segmentation purposes and we obtain the normalized Laplacian as obviously a symmetric matrix. Another normalized Laplacian of nonsymmetric form is given by For the purpose of this paper we focus on the symmetric normalized Laplacian L s but everything we derive here can equally be applied to the nonsymmetric version, where we would then have to resort to nonsymmetric Krylov methods such as GMRES [36]. It is well known in the area of data science, data mining, image processing and so on that the smallest eigenvalues and its associated eigenvectors possess crucial information about the structure of the data and/or image [44,47,5]. For this we state an amazing property of the graph Laplacian L for a general vector u ∈ R n with n the dimension of L which, as was illustrated in [44], is equivalent to the objective function of the graph RatioCut problem. Intuitively, assuming the vector u to be equal to a constant on one part of the graph A and a different constant on the remaining verticesĀ. In this case u T Lu only contains terms from the edges with vertices in both A andĀ. Thus a minimization of u T Lu results in a minimal cut with respect to the edge weights across A andĀ. Obviously, 0 is an eigenvalue of L and its normalized variants as L1 = D1 − W1 = 0 by the definitions of D and W with 1 being the vector of all ones. Additionally, spectral clustering techniques heavily rely on the computation of the smallest k eigenvectors [44] and recently semi-supervised learning based on PDEs on graphs introduced by Bertozzi and Flenner [5] utilizes a small number of such eigenvectors for a complexity reduction. It is therefore imperative to obtain efficient techniques to compute the eigenvalues and eigenvectors fast and accurately. Since we are interested in the k smallest eigenvalues of the matrix L s = I − D −1/2 WD −1/2 it is clear that we can compute the k largest postive eigenvalues of the matrix A := D −1/2 WD −1/2 . In case that the graph G = (V, E) is sparse in the sense that every vertex is only connected to a small number of other vertices and thus the matrix W is sparse, we can utilize the whole arsenal of numerical algorithms for the computations of a small number of eigenvalues, namely the Lanczos process [13], the Krylov-Schur method [41], or the Jacobi-Davidson algorithm [39]. In particular, the ARPACK library [21] in Matlab via the eigs function is a recommended choice. So frankly speaking, in the case of a sparse and symmetric matrix W the eigenvalue problem is fast and the algorithms are very mature. Hence, we focus on the case of fully connected graphs meaning that the matrix W is considered dense. The standard scenario for this case is that each node v j ∈ V corresponds to a data vector v j ∈ R d and the weight matrix is constructed as with a scaling parameter σ. For example, approaches with this kind of graph Laplacian have become increasingly popular in image processing [34], where the data vectors v j encode color information of image pixels via their color channels. The data point dimension may then be d = 1 for grayscale images and d = 3 for RGB images. Other applications may involve simple Cartesian coordinates for v j . While Equation (2.2) is derived from a Gaussian kernel function K(y) := exp(− y 2 /σ 2 ), other applications might call for different kernel functions like the "Laplacian RBF kernel" K(y) := exp(− y /σ), the multiquadric kernel K(y) := ( y 2 +c 2 ) 1/2 , or inverse multiquadric kernel K(y) := ( y 2 +c 2 ) −1/2 for a parameter c > 0, e.g. cf. Section 6.3. This means, the weight matrix may be of the form Often certain techniques are used to sparsify the Laplacian or otherwise reduce its complexity in order to apply the methods named above. In particular, sparsification has been proposed for the construction of preconditioners [40] for iterative solvers, which still require the efficient implementation of the matrix vector products. In image processing, this can be achieved by considering only patches or other reduced representations of the image [47]. However, this might drop crucial nonlocal information encoded in the full graph Laplacian [34,12], which is why we want to avoid it here and focus on fully connected graphs with dense Laplacians.
3. NFFT-based fast summation. For eigenvalue computation as well as various other applications with the graph Laplacian, one needs to perform matrix-vector multiplications with the matrix W or the matrix A := D −1/2 WD −1/2 . In general, this requires O n 2 arithmetic operations. When the matrix W has entries (2.2), this arithmetic complexity can be reduced to O(n) using the NFFT-based fast summation [31,32]. In general, this method may be applied when the entries of the matrix W can be written in the form where K : R d → C is a rotational invariant and smooth kernel function. For applying the NFFT-based fast summation for (2.3), it would be more convenient to consider the matrix W to have entries equal to K(0) on the diagonal and we refer to this matrix asW. Note that it can be written asW = W + K(0) I and thus W =W − K(0) I. In order to efficiently compute the row sums of W, which appear on the diagonal of D, we use We now illustrate how to efficiently compute the matrix-vector product with the matrixW using the NFFT-based fast summation. For instance, for the Gaussian kernel function, we have . , x n ] T and we rewrite (3.1) by with the kernel function K(y) := exp(− y 2 /σ 2 ). The key idea of the efficient computation of (3.2) is approximating K by a trigonometric polynomial K RF in order to separate the computations involving the vertices v j and v i . Assuming we have such a d-variate trigonometric polynomial with bandwidth N ∈ 2N and Fourier coefficientsb l , we replace K by K RF in (3.2) and we obtain x i e −2πilv i e 2πilv j , ∀j = 1, . . . , n.
Using the NFFT [18], one computes the inner and outer sums for all j = 1, . . . , n totally in O(m d n + N d log N ) arithmetic operations, where m ∈ N is an internal window cut-off parameter which influences the accuracy of the NFFT. Please note that since K RF and f RF are 1-periodic functions but neither K nor f are, ones needs to shift and scale the nodes v j such that they are contained in a subset of the cube Depending on the Fourier coefficientsb l , l ∈ I N , of the trigonometric polynomial K RF , wherê b l still have to be determined, we may need to scale the nodes v j to a slightly smaller cube. We emphasize that we are not restricted to the Gaussian weight function or a rotational invariant weight function. In fact, any kernel function K that can be well approximated by a trigonometric polynomial K RF may be used. Next, we describe an approach to obtain suitable Fourier coefficientsb l of K RF based on sampling values of K. Especially, we want to obtain a good approximation of K using a small number of Fourier coefficientsb l . Therefore, we regularize K to obtain a 1-periodic smooth kernel function K R , which is p − 1 times continuously differentiable (in the periodic setting), such that its Fourier coefficients decay in a fast way. Then, we approximate the Fourier coefficients of K R using the trapezoidal rule and this yields the Fourier coefficientsb l of K RF .
For a rotational invariant kernel function K(y), which is sufficiently smooth except at the "boundaries" of the cube [−1/2, 1/2] d , e.g. K(y) = exp(− y 2 /σ 2 ), we only need to regularize near y = 1/2. We use the ansatz where T B is a suitably chosen univariate polynomial, e.g. computed by a two-point Taylor interpolation. The parameter 0 < ε B 1/2 determines the size of the regularization region, cf. [32,Sec. 2]. For the treatment of a rotational invariant kernel function which has a singularity at the origin, we also refer to [32,Sec. 2]. Now we approximate K R by the d-variate trigonometric polynomial K RF from (3.3), where we compute the Fourier coefficients Assuming one evaluation of K R takes O(1) arithmetic operations, the computations in (3.4) require O(N d log N ) arithmetic operations in total using the fast Fourier transform. If all vertices v j and their corresponding data vectors v j ∈ R d , j = 1, . . . , n, fulfill the and we obtain an approximation of (3.2) by Otherwise, we compute a correction factor ρ := (1/4 − ε B /2)/ max j=1,...,n v j , using transformed verticesṽ j := v j ρ, and adjust parameters of the kernel function appropriately. For instance, in case of the Gaussian kernel function, we replace the scaling parameter σ byσ := σρ for the regularized kernel function K R .
The error of the approximation f (v j ) := (Wx) j ≈ f RF (v j ) depends on the kernel function as well as on the choice of the regularization smoothness p, the size of the regularization region ε B , the bandwidth N , and the window cut-off parameter m. For a fixed accuracy, we fix these parameters p, ε B , N , and m. Hence, for small to medium dimensions d, we obtain a fast approximate algorithm for the matrix-vector multiplicationWx of complexity O(n), cf. Algorithm 3.1. This algorithm is implemented as applications/fastsum and matlab/fastsum in C and Matlab within the NFFT3 software library 1 , see also [18], and we use the default Kaiser-Bessel window function. In Figure 1, we list the relevant control parameters of Algorithm 3.1 and regularization approach (3.4).
Algorithm 3.1 Fast approximate matrix-vector multiplicationWx using NFFTbased fast summation, 1. Apply d-dimensional adjoint NFFT on x and obtain Complexity: O n for fixed accuracy.

Parameter Description
N ∈ 2N bandwidth (in each dimension) of trigonometric polynomial, such that K RF ≈ K m ∈ N window cut-off parameter of NFFT (m = 8 gives approximately IEEE double precision for default Kaiser-Bessel window) Note that every part of Algorithm 3.1 is deterministic and linear in the input vector x, i.e., the algorithm constitutes a linear operator that can be written asW + E with an error matrix E. For theoretical error estimates on [31,32,19]. The basic idea is to start with the estimate caused by the approximation of the kernel K by a trigonometric polynomial K RF , and to additionally take the errors caused by the NFFT into account. In practice, one may guess K ERR ∞ based on sampling values of K and K RF . For theoretical error estimates of the NFFT for various window functions, we refer to [18,27]. In practice, choosing the window cut-off parameter of the NFFT m = 8 yields approximately IEEE double precision for the default Kaiser-Bessel window, see e.g. [18,Sec. 5.2]. We again emphasize that Algorithm 3.1 is not restricted to the Gaussian kernel function. Any kernel function K that can be well approximated by a trigonometric polynomial may be used and the corresponding Fourier coefficientsb l , l ∈ I N , are an input parameter of Algorithm 3.1.
Moreover, for the Gaussian kernel function, one could also use the analytic Fourier coefficientsb l from [19] for small values of the scaling parameter σ instead of computingb l by interpolation in (3.4). In this case, explicit error bounds for K ERR ∞ are available.
3.1. Error propagation for normalized matrices. As seen in Section 2, many applications involving the Graph Laplacian require matrix vector products with a matrix A that itself does not follow the form of (2.3), but results from normalization of such a matrix W. This normalization can be written as x, this also includes the computation of the degree matrix D. The error occurring from this approximation will then propagate to the evaluation error of Ax.
Algorithm 3.2 summarizes the usage of Algorithm 3.1 for this case. Note that if multiple matrix-vector products are required, e.g. in an iterative scheme, steps 1-4 can be performed once in a setup phase. The following lemma gives an estimation of the error of Algorithm 3.2 depending on the relative error of Algorithm 3.1.
Lemma 3.1. Let W ∈ R n×n be a matrix with non-negative entries and at least one positive entry per row. Given an error matrix E ∈ R n×n , we define W E = W + E and Let d min > 0 denote the minimum diagonal entry of D and furthermore set Then, for ε < η, it holds Proof. Due to Scaling parameter and vertex set, v j ∈ R d , specifying W, 1. Choose correction factor ρ such that ρv j 2 ≤ 1/4 − ε B /2 for all j = 1, . . . , n.

For the computation of matrix-vector products with the matrix
by Algorithm 3.1, determine appropriate control parameters for the NFFT-based fast summation, see Figure 1, and obtain Fourier coefficientsb l , l ∈ I N , e.g. by (3.4) or [19].
(scale output of Algorithm 3.1 by ρ for multiquadric kernel and 1/ρ for inverse multiquadric kernel).
(scale output of Algorithm 3.1 by ρ for multiquadric kernel and 1/ρ for inverse multiquadric kernel).

Output: y
Approximate result of Ax.
Complexity: O n for fixed accuracy. and the fact that x → x −1/2 and its first derivative are monotoneously decreasing, we obtain Analogously we obtain Now detach the left underlined part from its paranthesed expression and combine it with the right underlined part: Resolve the binomial expression in the first line and simplify the second line: This concludes the proof for the desired inequality.
The requirement ε < η means that E ∞ must be smaller than the smallest diagonal entry in D. This condition cannot be avoided since otherwise, negative entries in D E could not be ruled out, leading to imaginary entries in D −1/2 E and thus in A E . On the other hand, if ε is well below η, Lemma 3.1 yields that the absolute error in A is linear in ε, which is the relative error of Algorithm 3.1.
Alternatively, by ignoring the error caused by the NFFT, we obtain error estimations of the form In other words, the perturbation grows linearly in the size of the dataset. If either W ∞ or d min grew less fast, then Lemma 3.1 would not be applicable for large n because ε would eventually supersede η. However, if we assume that increasing n means adding more similarlydistributed data points to the dataset, the average entry in W does not change and thus all row sums of W also grow linearly in n, including d min and the maximum row sum W ∞ . A mathematical quantification of this observation is beyond the scope of this article, but in practice, the values for η and ε can be approximated and monitored to give a-posteriori error bounds. One way to do this is by using (3.6) and approximating K ERR ∞ via (3.5), where the maximum can be discretized in a large number of randomly drawn sample points. The accuracy of this approximation can be validated by explicitly computing the exact absolute row sum E ∞ via where | · | is applied elementwise, e i denotes the i-th unit vector, and matrix-vector products withW E = W + K(0)I + E are evaluated using Algorithm 3.1. The effort of computing (3.7) is O(n 2 ). Equivalently, the true value for A − A E ∞ can be computed via 4. Krylov subspace methods and NFFT. The main contribution of this paper is the usage of NFFT-based fast summation for accelerating Krylov subspace methods, which are the state-of-the-art schemes for the solution of linear equation systems, eigenvalue problems, and more [35]. In the case of large dense matrices, the computational bottleneck is the setup of and multiplication with the system matrix itself. We will here exemplarily illustrate this for the Lanczos algorithm [20], which is the standard method for computation of a few dominating, i.e. largest, eigenvalues of a symmetric matrix A [30,13]. It is based on looking for an A-invariant subspace in the Krylov space This is achieved by iteratively constructing an orthonormal basis q 1 , . . . , q k of this space in such a way that the matrix Q k = [q 1 , . . . , q k ] ∈ R n×k yields a tridiagonalization of A, i.e.
Such a matrix Q k as well as the entries of T k can be computed by the iteration The remarkable fact that this iteration produces orthonormal vectors is a consequence of the symmetry of A. We now summarize the first k steps of the Lanczos process in the relation where e j denotes the j-th standard basis vector of the appropriate dimension. The eigenvalues and eigenvectors of the small matrix T k are called the Ritz values and vectors, respectively, and can be computed efficiently. From T k w = λw we then obtain where w k is the k-th component of the Ritz vector w. We finally see via that a small value |β k+1 | indicates that (λ, Q k w) is a good approximation to an eigenpair of A and that the Krylov space is close to containing an A-invariant subspace. There are many more practical issues that make the implementation of the Lanczos process more efficient and robust. We do not discuss these points in detail but refer to [30,21] for the details. Additionally, we want to point out that the above procedure can also be used for the solution of linear systems of equations. Standard methods based on the Lanczos method are the conjugate gradients method [16] and the minimal residual method [29], which are tailored for the solution of linear systems of the form Ax = b. Note that such applications involving the graph Laplacian are commonly found in kernel based methods [6]. In the nonsymmetric case that comes up e.g. when considering L w , we can employ the Arnoldi method [35], which relies on a similar iteration where T k is replaced by an upper Hessenberg matrix.
One main contribution of this paper is the fact that by evaluating matrix-vector products via the NFFT-based Algorithms 3.1 or 3.2, Krylov subspace methods are still applicable for dense matrices that are too large to store, let alone apply, as long as they stem from the kernel structure of (2.3) or normalization of such a matrix. In our experiments, this method will be denoted as NFFT-based Lanczos method.
A detailed discussion of the effect of inexact matrix-vector products on Krylov-based approximations can be found in [38].
5.1. The traditional Nyström extension. The Nyström extension is currently used as a method of choice to compute eigenvalue approximations of kernel-based matrices that are too large to allow for direct eigenvalue computation. See e.g. [11] and [25] for its applications in different settings. Originally introduced to the matrix computations context in [45], further improvements have been suggested in [10] and [9] and its usage for classification problems has been proposed in [5]. It is based on dividing the data points into a sample set X of L nodes and its complement Y . After permutation, the adjacency matrix W can be split into blocks where the blocks W XX ∈ R L×L and W Y Y ∈ R (n−L)×(n−L) are the adjacency matrices of the canonical subgraphs with node sets X and Y , respectively, and the block W XY ∈ R L×(n−L) contains the similarities between all combinations of nodes from X and Y . The basic idea of the Nyström method is to compute only W XX and W XY explicitly, but not the remaining block W Y Y . If L n, the approach significantly decreases the required number of data point comparisons. Assuming that W XX is regular, the method approximates W by which constitutes a rank-L approximation due to the size and regularity of W XX . This formula is used once in approximating the degree matrix D by D E = diag(W E 1) and once in approximating the eigenvalues of A via the rank-L eigenvalue decomposition This can be computed without having to set up the full matrix, e.g. by the technique described in [10] made up mainly of two singular value decompositions of (L × L)-sized matrices, which is technically only applicable if W is positive definite. Alternatively, we have achieved better results by computing the QR factorizationQR := D The eigenvalue accuracy depends strongly on the quality of the approximation Since the sample set X is a randomly chosen subset of the indices from 1, . . . , n, its size L is the decisive method parameter and its choice is a nontrivial task. On the one hand, L needs to be small for the method to be efficient. On the other hand, a too small choice of L may cause extreme errors, especially because the approximation error in D E propagates to the eigenvalue computation. In spite of the positivity of the diagonal of D, negative entries in D E cannot be ruled out and are observed in practice. Hence imaginary entries may occur in D −1/2 E and thus A E , making the results extremely unreliable. This behaviour follows the same structure as Lemma 3.1, however, we do not have a meaningful bound on W Y Y − W T XY W −1 XX W XY ∞ that would guarantee favorable error behaviour.

A NFFT-based accelerated Nyström-Gaussian method.
Another important contribution of this paper is the development of an improved Nyström method, which utilizes the NFFT-based fast summation from Section 3. It is based on a slightly different algorithm that has been recently introduced as a Nyström method, cf. [24] and the references therein. Their basic idea is rewriting the traditional Nyström approximation as where Q ∈ R n×L is a matrix with orthogonal columns. If Q holds the first L columns of a permutation matrix, one obtains the traditional Nyström method from Section 5.1. Inspired by similar randomized linear algebra algorithm such as randomized singular value decomposition, this choice of Q is replaced in [24] by Q = orth(AG), where G ∈ R n×L is a Gaussian matrix with normally distributed random entries and orth denotes column-wise orthonormalization. Unfortunately, this setup requires 2L matrix-vector products with the full matrix A.
We now propose accelerating these matrix-vector products by computing AQ column-wise via the NFFT-based fast summation Algorithm 3.1 in order to avoid full matrix setup or slow direct matrix-vector products. In addition, we propose replacing the inverse (Q T AQ) −1 by a low-rank approximation based only on the M ∈ N largest eigenvalues of Q T AQ. This way, a rank-M approximation of A is produced, where M may be the actual number of required eigenvalues or larger. The resulting method "Nyström-Gaussian-NFFT" is presented in Algorithm 5.1. Its arithmetic complexity is O n L 2 . On the first glance, this arithmetic complexity seems to be identical to the one of the traditional Nyström method from Section 5.1. However, as we observe in the numerical tests in Section 6.1, we may choose the parameter L distinctly smaller for Algorithm 5.1, i.e., L ∼ k, where k is the number of eigenvalues and eigenvectors. Then, the resulting arithmetic complexity is O n k 2 . Output: Λ k ∈ R k×k diagonal matrix of approximated largest eigenvalues of A, V k ∈ R n×k corresponding orthonormal eigenvector matrix.
Complexity: O nL 2 6. Numerical results. All our experiments are performed using Matlab implementations based on the NFFT3 library and Matlab's eigs function. A short example code can be found on the homepage of the authors. 2 6.1. Accuracy and runtime of eigenvalue computations. We use the function generateSpiralDataWithLabels.m 3 to generate varying sets of three-dimensional data. The data points are in the form of a spiral and we can specify the number of classes as well as the number of points per class. We generate data sets with 5 classes and equal numbers of points per class, which have a total number of data points n ∈ {2 000, 5 000, 10 000, 20 000, 50 000, 100 000}. For the generation, we use the default parameters h = 10 and r = 2 in generateSpiralDataWithLabels.m. For each n, we generate 5 random spiral data sets. In Figure 2a, we visualize an example data set with n = 2 000 total points. For the adjacency matrix W, we set the scaling parameter σ = 3.5. Using the NFFT-based Lanczos method from Section 4, we compute the 10 largest eigenvalues and the corresponding eigenvectors of the matrix A := D −1/2 WD −1/2 for each data set. We consider three different parameter setups for the NFFT in Algorithm 3.1, achieving different accuracies. We set the bandwidth N = 16 and the window cut-off parameter m = 2 in setup #1, N = 32 and m = 4 in setup #2, as well as N = 64 and m = 7 in setup #3. For all three setups, we use ε B = 0. For comparison, we also apply the Nyström method from Section 5.1, where we perform 10 repetitions for each data set, since the method uses random sub-sampling in order to obtain a rank-L approximation of the adjacency matrix W. We consider two different Nyström setups with rank L ∈ {n/10, n/4}. Moreover, we use the hybrid Nyström-Gaussian-NFFT method from Algorithm 5.1 in Section 5.2 with L ∈ {20, 50} Gaussian columns, parameter M = 10 as well as fast summation parameters corresponding to setup #2, where we perform 10 repetitions for each data set. Additionally, we compute the eigenvalues and eigenvectors by a direct method, which applies the Lanczos method using full matrix-vector products with the adjacency matrix W. For the Nyström method from Section 5.1 and the direct computation method, we only run tests for a total number of data points n ∈ {2 000, 5 000, 10 000, 20 000} due to long runtimes.   In Figure 3, we visualize the results of the test runs. We show the minimum, average and maximum of the maximum eigenvalue errors in Figure 3a. For this, we first determine the maximum eigenvalue errors (6.1) max j=1,...,10 for each test run, where λ j denotes the j-th eigenvalue computed by the method under consideration and λ (direct) j the one computed by a direct method using full matrix-vector products with the matrix A. Then, for fixed total number of data points n and fixed parameter setup, we compute the minimum, average and maximum of (6.1), where the minimum, average and maximum are computed using 5 instances of (6.1) for the NFFT-based Lanczos method and 5 · 10 instances of (6.1) for the Nyström-based methods. We observe that the averages of the maximum eigenvalue errors (6.1) are above 10 −2 for the two considered parameter choices of the Nyström method from Section 5.1, even when the rank L is chosen as a quarter of the matrix size n. Moreover, the minima and maxima of (6.1) differ distinctly from the averages. In particular, the accuracies may vary strongly across different Nyström runs on an identical data set. For the NFFT-based Lanczos method, each minimum, average and maximum of the maximum eigenvalue errors (6.1) only differs slightly from one another. The maximum eigenvalue errors (6.1) are around 10 −4 to 10 −3 for parameter setup #1, around 10 −10 to 10 −9 for setup #2, and below 10 −14 for setup #3. For the hybrid Nyström-Gaussian-NFFT method, which internally uses 2L many NFFT-based fast summations with parameter setup #2, the maximum eigenvalue errors (6.1) are around 10 −3 to 10 −2 for parameter L = 20 and around 10 −5 to 10 −4 for L = 50. This means that the observed maximum eigenvalue errors (6.1) are distinctly smaller compared to the ones of the traditional Nyström method, and the errors for parameter L = 50 are slightly smaller than the ones of the NFFT-based Lanczos method with parameter setup #1.
In Figure 3b, we depict the minimum, average and maximum of the maximum residual norms (6.2) for each total number of data points n. We compute these numbers by first determining the maximum residual norms Av j − λ j v j 2 for each test run, where λ j denotes the j-th eigenvalue of A and v j the corresponding eigenvector. Then, for fixed n and fixed parameter setup, we compute the minimum, average and maximum of (6.2). We observe that the averages of the maximum residual norms (6.2) are above 10 −1 for the considered parameter choices of the Nyström method, even when the rank L is chosen as a quarter of the matrix size n. Moreover, the minima and maxima of the maximum residual norms (6.2) differ distinctly from the averages. Especially, the accuracies may vary strongly across different Nyström runs on an identical data set. For the NFFTbased Lanczos method, each minimum, average and maximum of (6.2) only differs slightly from one another. The maximum residual norms (6.2) are around 10 −4 to 10 −3 for parameter setup #1, around 10 −8 for setup #2, and around 10 −15 to 10 −13 for setup #3. For the hybrid Nyström-Gaussian-NFFT method, maximum residual norms (6.2) are around 10 −2 for parameter L = 20 and around 10 −4 to 10 −3 for L = 50. In the latter case, the errors are slightly larger than the ones of the NFFT-based Lanczos method with parameter setup #1 for n ∈ {2 000, 5 000, 10 000, 20 000} data points and slightly smaller for n ∈ {50 000, 100 000}.
Additionally, in Figure 3c, we investigate the average and maximum of the maximum residual norms (6.2) for each fixed eigenvalue λ j for n = 20 000 data points. For Nyström L = n/10, we observe that the residual norms belonging to the first eigenvalue are distinctly larger than for the remaining eigenvalues. In general, the observed maximal residual norms (6.2) vary similarly for each eigenvalue. For the NFFT-based Lanczos method with parameter setup #2 and #3, the maximum residual norms (6.2) of the tail eigenvalues are slightly smaller than of the leading eigenvalues, which is not the case for the parameter setup #1 as well as for the results of the hybrid Nyström-Gaussian-NFFT method.
In Figure 3d, we show the average and maximum runtimes of the different methods and parameter choices in dependence of the total number of data points n. The runtimes were determined on a computer with Intel Core i7 CPU 970 (3.20 GHz) using one thread. We remark that the NFFT supports OpenMP, cf. [43], but we restricted all time measurements to 1 thread for better comparison. We observe that the runtimes of the traditional Nyström method grow approximately like ∼ n 3 , and the runtimes of the direct computation method for the eigenvalues grow approximately like ∼ n 2 . Moreover, the slopes of the runtime graphs of the NFFT-based Lanczos method are distinctly smaller and the runtimes grow approximately like ∼ n. Depending on the parameter choices, the NFFT-based Lanczos method is faster than the Nyström method once the total number of data points n is above 2 000 -10 000. The hybrid Nyström-Gaussian-NFFT method with parameter L = 20 is slightly slower than the NFFT-based Lanczos method with setup #2. For the parameter L = 50 the method is slower by a factor of approximately 2.5. In both cases, the runtimes grow approximately like ∼ n. The runtimes of the direct method were the highest ones in most cases. For the tests, we precomputed the diagonal entries of the matrix D −1/2 but we computed the entries of the weight matrix W again for each matrix-vector multiplication with the matrix A. Alternatively, one could store the whole matrix A ∈ R n×n for small problem sizes n and this would have reduced the runtimes of the direct method to 1/20. However, then we would have to store at least n(n − 1)/2 values, which would already require about 10 GB RAM for n = 50 000 and double precision.
For comparison, we also applied the FIGTree method from [26] to our testcases, and we denote the obtained results by "FIGTree-Lanczos" in Figure 3. The FIGTree accuracy parameter was chosen ∈ {5·10 −3 , 2·10 −6 , 10 −10 } such that the resulting residual norms (6.2) in Figure 3b approximately match those of the NFFT-based Lanczos method for setup #1,#2,#3. We observe that the obtained eigenvalue accuracies in Figure 3a are similar for = 5 · 10 −3 and 10 −10 to the ones of the NFFT-based Lanczos method for setup #1 and #3, respectively. For n ≥ 5 000 data points and FIGTree accuracy parameter = 2·10 −6 , we observe for our testcase that the obtained eigenvalue accuracies are lower by about two order of magnitudes compared to the NFFT-based Lanczos method with setup #2. When looking at the obtained runtimes, we observe that "FIGTree-Lanczos" requires approximately 4 times to 7 times the runtime of the corresponding NFFT-based Lanczos method with comparable eigenvector accuracy in most cases.

Applications.
In the following, we will showcase the effect of the improved accuracy on popular data science methods that utilize the graph Laplacian matrix. We will compare how the methods perform if the eigenvectors are computed with the NFFT-based Lanczos method or the traditional Nyström extension.
6.2.1. Spectral clustering. Spectral clustering is an increasingly popular technique [44] and we briefly illustrate the method proposed in [28]. The basis of their algorithm is a truncated eigenapproximation V k D k V T k with V k ∈ R n×k , which is an approximation based on the smallest eigenvalues and eigenvectors of the graph Laplacian. Now the rows of V k are normalized to obtain a matrix Y k . The normalized rows are then divided into a fixed number of disjoint clusters by a standard k-means algorithm.
Here, we apply spectral clustering to an image segmentation problem. The original image of size 533 × 800 is depicted in Figure 5a. We construct a graph Laplacian where each pixel corresponds to a node in the graph and the distance measure is the distance between the values in all three color channels, such that each vertex v j ∈ {0, 1, . . . , 255} 3 . Correspondingly, the graph Laplacian would be a dense matrix of size 426 400 × 426 400. We set the scaling parameter σ = 90. Figure 4 shows the first ten eigenvalues of the matrix A.
For obtaining reference results, we use the Matlab function eigs on the full matrix A computing 4 eigenvectors and this required more than 31 hours using up to 32 threads on a computer with Intel Xeon E7-4880 CPUs (2.50 GHz), using more than 500 CPU hours in total. Next, we applied the NFFT-based Lanczos method from Section 4 with parameters N = 16, m = 2, p = 2 and ε B = 1/8 for the eigenvector computations. We show the results in Figure 5b and 5c for k = 2 and k = 4 classes, respectively. The segmented images look satisfactory. The main features of the image are preserved and large areas of similar color are correctly assigned to the same cluster, while there are only small "noisy" areas. Compared to the segmented image from the direct computations, we have approximately 0.1 % differences (467 out of 426,400) in the class assigments in the case of k = 4 classes. For the runtimes, we measure approximately 25 seconds for the NFFT-based Lanczos method and 18 seconds for the k-means algorithm on a computer with Intel Core i7 CPU 970 (3.20 GHz) using one thread.
Additionally, we ran the Nyström method 100 times with parameter L = 250. Here the runtimes were approximately 60 seconds on average without the runtime for the clustering. We applied the k-means algorithm for k = 4 classes, which required approximately 22 seconds on average. We observed that in 79 of the 100 test runs of Nyström followed by k-means, the images appear to be very close to the ones obtained when applying eigs on the full matrix A, i.e., the differences are less than 2 %. In Figure 5d, we visualize the results of a corresponding test run. However, in 13 of the 100 test runs, the Nyström method returned eigenvectors which caused segmentation differences of more than 20 % with such "noisy" images that we consider these as "failed" runs. See Figure 5e for one example with approximately 25 % differences. The differences between Figure 5c and 5e are shown as a black and white picture in Figure 5f. Moreover, we tested increasing the parameter L to 500. Then, the run times increased to approximately 152 seconds on average. When applying the k-means algorithm to the obtained eigenvectors, the results improved. The differences compared to the reference image segmentation are less than 2 % in 85 of the 100 test runs and larger than 20 % in 9 test runs. 6.2.2. Semi-supervised learning by a phase field method. We here want to state an exemplary method that relies heavily on a number of eigenvectors of the graph Laplacian. It was proposed by Bertozzi and Flenner [5] and corresponds to a semi-supervised learning (SSL) problem. Suppose we have a graph-based dataset as before where each vertex is assigned to one of C classes. A training set of s random sample vertices from each class is set up. For the case of C = 2 classes, a training vector f ∈ R n is set up with entries −1 for training nodes from one class, 1 for training nodes from the other class, and 0 for nodes that do not belong to the training data. The task of SSL is to use f to find a classification vector u ∈ R n . The sign of its entries is then used to predict each node's assigned class.
One successful approach computes u as the end point of the trajectory described by the Allen-Cahn equation [42,23] for details). Here ψ(u) = (u 2 − 1) 2 is the double-well potential, which we understand to be applied component-wise, and Ω denotes a diagonal matrix with entries Ω ii = ω 0 > 0 if vertex i belongs to the training data and Ω ii = 0 otherwise. To discretize this ODE we will not introduce an index for the temporal discretization but rather assume that all values u are evaluated at the new time-point whereasū indicates the previous time-point. We then obtain where u is a vector defined on the graph on which we base the final classification decision. Here, c > 0 is a positive parameter for the convexity splitting technique [5]. For a more detailed discussion of how to set these parameters we refer to [5,7]. We now use the k computed eigenvalues and eigenvectors (λ j , v j ) of L s such that we can write u = k j=1 u j v j and from this we get This equation can be solved to obtain the new coefficients u j from the old coefficientsū j . After a sufficient number of time steps, u will converge against a stable solution. (e) k = 4 classes, Nyström ("failed" run) (f) differences between (c) and (e) Figure 5. Results of image segmentation (533 × 800 = 426 400 pixels) via spectral clustering and k-means using the NFFT-based Lanczos method from Section 4 and the Nyström method from Section 5.1. "Failed run" in Subfigure (e) means segmentation differences of more than 20 % compared to the results obtained when applying eigs on the full matrix A.
We apply this method to the same spiral data set as seen in Section 6.1, again with σ = 3.5 but this time only with n = 100 000. The data points have been generated by a multivariate normal distribution around five center points, and the true label of each vertex has been set to the center point that is closest to it. We computed the eigenvectors to the k = 5 smallest eigenvalues of the Laplacian; once by the NFFT-based Lanczos method with n = 32, m = 4, and ε B = 0, and once with the traditional Nyström method with L = 1 000 where only 5 columns of V L are used. We then applied the described method with τ = 0.1, ε = 10, ω 0 = 10 000, and c = 2 ε + ω 0 . The iteration terminated if the squared relative change in u was less than 1e-10. We repeat this process for 50 instances of the spiral dataset and sample sizes s ∈ {1, 2, 3, 4, 5, 7, 10}.  Figure 6 depicts the average accuracy results. We conclude that in this example, the increased eigenvector quality achieved by the NFFT-based method yields an average accuracy boost of approximately 0.5 to 1.5 percentage points, as well as the worst result being significantly less bad. On a computer with Intel Core i7 CPU 4770 (3.40 GHz), the runtimes were approximately 8 seconds for the NFFT-based Lanczos method, 27 seconds for the Nyström method, and less than a second for the solution of the Allen-Cahn equation, which almost always converged after only three time steps.
6.2.3. Semi-supervised learning by a kernel method. In addition to the phase field method, we employ a second semi-supervised learning technique used in [48,14] for SSL problems with only two classes. Based on a training vector f holding 1, -1, or 0 just as in the previous section, a similar u is obtained by minimizing the function where β can be understood as a regularization parameter. For the solution of this minimization problem, we only have to solve the equation where I is the identity matrix. Similar systems arise naturally in scattered data interpolation [17]. We run numerical tests using the crescentfullmoon.m 5 data set with n = 100 000 data points and parameters r1=5, r2=5, r3=8. As illustrated in Figure 2b, the set is divided into two classes of points in the full moon and the crescent, distributed in a 1-to-3 ratio. We generate 5 random instances of the data set, and for each instance we run 10 repetitions with randomly chosen training data, where we consider s ∈ {1, 2, 5, 10, 25} known samples per class.
For the adjacency matrix W, we set the scaling parameter σ = 0.1. The tests are run with regularization parameter β ∈ {10 3 , 3 · 10 3 , 10 4 , 3 · 10 4 , 10 5 }. We solve each system (6.4) using the CG algorithm with tolerance parameter 10 −4 and a maximum number of 1 000 iterations. For the fast matrix-vector multiplications with the matrix L s , we use the NFFT-based fast summation in Algorithm 3.1 with parameters N = 512, m = 3, ε B = 0. In Figure 7, we visualize the average and maximum misclassification rate of the 5 · 10 test runs for each fixed s and β. In the left plot, we show the misclassification rate in dependence of the number of samples s per class for the different regularization parameters β. We observe in general that the misclassification rates decrease for increasing s. The lowest rate is achieved for s = 25 samples per class and β = 10 4 , where the average and maximum misclassification rate are 0.0012 and 0.0036, respectively. In the right plot, we depict the misclassification rate in dependence of the regularization parameter β for fixed number of samples s per class. For s ∈ {1, 2, 5}, the average misclassification rates decline for increasing β until β = 3 · 10 4 and grow again for β = 10 5 . For s ∈ {10, 25}, the average misclassification rates decline for increasing β until β = 10 4 and grow again afterwards. We remark that in all test runs, the maximum number of CG iterations was 536 and the maximum runtime for solving (6.4) was approximately 151 seconds on a computer with Intel Core i7 CPU 970 (3.20 GHz) using one thread.
Additionally, we used the NFFT-based Lanczos method from Section 4 in order to approximate the matrix A := D −1/2 WD −1/2 by a truncated eigenapproximation V k D k V T k with V k ∈ R n,k and this allows for computing the matrix-vector products in (6.4) in a fast way for fixed small k. Using k = 10 eigenvalues and eigenvectors, we achieve similar results as those shown in Figure 7. The computation of the eigenapproximation required up to 6 minutes on a computer with Intel Core i7 CPU 970 (3.20 GHz) using one thread. The maximum runtime for solving (6.4) was approximately 0.15 seconds.
Alternatively, we applied the Nyström method from Section 5.1 with parameter L = 5 000 to obtain a truncated eigenapproximation, where the corresponding computation required more than 3 hours for each eigenapproximation. However, the eigenvalues were not computed correctly in our tests. This was due to the matrix block W XX in Equation (5.1) being illconditioned. Consequently the CG method aborted in the first iteration and the output could not be used for classification. In order to illustrate the flexibility of the NFFT-based fast summation, we also apply Algorithm 3.1 to a non-Gaussian weight function w in (2.2). Here, we consider the "Laplacian RBF kernel" K(y) := exp(− y /σ), such that the weight matrix is constructed as In our numerical tests, we set the shape parameter σ = 0.05 and we visualize the test results in Figure 8. We observe that the obtained misclassification rates are similar to the ones in Figure 7, where the Gaussian kernel was used. For some parameter settings, the misclassification rates are slightly better, for other ones slightly worse.

Kernel ridge regression.
In this section we show that our approach can be applied to the problem of kernel ridge regression, which has a similar flavour to the problem from the previous section. We here illustrate that our method is very flexible since other than just Gaussian kernels can be used for the fast evaluation of matrix-vector products. The starting point is a simple linear regression problem via the minimization of (6.6) arg min where X ∈ R n×d is a design matrix holding training feature vectors x j ∈ R d in its rows, i.e. X T = [x 1 , . . . , x n ], and f ∈ R n is a given response vector. The solution u to this problem can then be used in a linear model to predict a response for any new point x ∈ R d as F (x) = u T x.
The well-known solution formula can be rearranged using the Sherman-Morrison-Woodbury formula to obtain Using this formula, we can introduce the dual variable α = XX T + βI −1 f and rewrite the predicted response of a new point x as An idea for increasing the flexibility of this method is replacing expressions x T i x j with K(x i , x j ) where K : R d × R d → R is an arbitrary kernel function [33]. This leads to replacing XX T with the Gram matrix K with entries K ij = K(x i , x j ) ∀ i, j = 1, . . . , n.
Consequently, the dual variable becomes α = (K + βI n ) −1 f and we obtain the kernel-based prediction function For more details we refer to [33]. It is easily seen that the main effort of this algorithm goes into the computation of the coefficient vector α = (K + βI n ) −1 f . Note that this is were we again use the NFFT-based matrix vector products in combination with the preconditioned CG method as the matrix K + βI n is positive definite and amenable to being treated using the NFFT for a variety of different kernel functions. In Figure 9 we illustrate the results when kernel ridge regression is used with two different kernels, namely the Gaussian and the inverse multiquadric kernel. 7. Conclusion. In this work, we have successfully applied the computational power of NFFT-based fast summation to core tools of data science. This was possible due to the nature of the fully connected graph Laplacian and the fact that many algorithms -most notably the Lanczos method for eigenvalue computation -only require matrix-vector products with the Laplacian matrix. By using Fourier coefficients to approximate the Gaussian kernel, we use Algorithm 3.1 to compute strong approximations of the matrix-vector product in O(n) complexity without storing or setting up the full matrix, as opposed to the full matrix's O(n 2 ) storage, setup, and application complexity.
For eigenvalue and eigenvector computations, we have discussed the current alternative method of choice in the Nyström extension and developed a hybrid method that allows the basic Nyström idea to benefit from NFFT-based fast matrix-vector products. In our numerical experiments, we found that the Nyström-Gaussian-NFFT method achieved much better eigenvalue accuracy than the traditional Nyström extension even for a significantly smaller parameter L, but was in turn outperformed by the NFFT-based Lanczos method.
In strongly eigenvector-dependent applications like in Section 6.2.2, the higher accuracy of the NFFT-based Lanczos method directly leads to better classification results. In some other applications, however, it is hard to predict if better eigenvector accuracy distinctly improves the results. For instance in Section 6.2.1, the traditional Nyström extension still achieved good image clusterings on average with small parameter L despite its rather inaccurate eigenvectors. Here, the NFFT-based Lanczos method still has very good selling points in its greatly improved runtime as well as its consistency, while the traditional Nyström tends to "fail" in some test runs.