A Spectral Series Approach to High-Dimensional Nonparametric Regression

A key question in modern statistics is how to make fast and reliable inferences for complex, high-dimensional data. While there has been much interest in sparse techniques, current methods do not generalize well to data with nonlinear structure. In this work, we present an orthogonal series estimator for predictors that are complex aggregate objects, such as natural images, galaxy spectra, trajectories, and movies. Our series approach ties together ideas from kernel machine learning, and Fourier methods. We expand the unknown regression on the data in terms of the eigenfunctions of a kernel-based operator, and we take advantage of orthogonality of the basis with respect to the underlying data distribution, P, to speed up computations and tuning of parameters. If the kernel is appropriately chosen, then the eigenfunctions adapt to the intrinsic geometry and dimension of the data. We provide theoretical guarantees for a radial kernel with varying bandwidth, and we relate smoothness of the regression function with respect to P to sparsity in the eigenbasis. Finally, using simulated and real-world data, we systematically compare the performance of the spectral series approach with classical kernel smoothing, k-nearest neighbors regression, kernel ridge regression, and state-of-the-art manifold and local regression methods.


Introduction
A challenging problem in modern statistics is how to handle complex, highdimensional data.Sparsity has emerged as a major tool for making efficient inferences and predictions for multidimensional data.Generally speaking, sparsity refers to a situation where the data, despite their apparent high dimensionality, are highly redundant with a low intrinsic dimensionality.In our paper, we use the term "sparse structure" to refer to cases where the underlying distribution 1 P places most of its mass on a subset X of R d of small Lebesgue measure.This scenario includes, but is not limited to, Riemannian submanifolds of R d , and high-density clusters separated by low-density regions.In applications of interest, observable data often have (complex) sparse structure due to the nature of the underlying physical systems.For example, in astronomy, raw galaxy spectra are of dimension equal to the number of wavelength measurements d, but inspection of a sample of such spectra will reveal clear, low-dimensional features and structure resulting from the shared physical system that generated these galaxies.While the real dimensionality of data is much smaller than d, the challenge remains to exploit this when predicting, for example, the age, composition, and star formation history of a galaxy.
In its simplest form, low-dimensional structure is apparent in the original coordinate system.Indeed, in regression, much research on "large p, small n" problems concerns variable selection and the problem of recovering a "sparse" coefficient vector (i.e., a vector with mostly zeros) with respect to the given variables.Such approaches include, for example, lasso-type regularization [51], the Dantzig selector [9], and RODEO [28].There are also various extensions that incorporate lower-order interactions and groupings of covariates [57,61,37] but, like lasso-type estimators, they are not directly applicable to the more intricate structures observed in, e.g., natural images, spectra, and hurricane tracks.
At the same time, there has been a growing interest in statistical methods that explicitly consider geometric structure in the data themselves.Most traditional dimension-reducing regression techniques, e.g., principal component regression (PCR; [26]) partial least squares (PLS; [55]) and sparse coding [35], are based on linear data transformations and enforce sparsity (with respect to the L 1 or L 2 norm) of the regression in a rotated space.More recently, several authors [7,2,10] have studied local polynomial regression methods on non-linear manifolds.For example, Aswani et al. [2] propose a geometry-based regularization scheme that estimates the local covariance matrix at a point and then penalizes regression coefficients perpendicular to the estimated manifold direction.In the same spirit, Cheng and Wu [10] suggest first reducing the dimensionality to the estimated intrinsic dimension of the manifold, and then applying local linear regression to a tangent plane estimate.Local regression and manifold-based methods tend to perform well when there is a clear submanifold but these approaches are not practical in higher dimensions or when the local dimension varies from point to point in the sample space.Hence, existing nonparametric models still suffer when estimating unknown functions (e.g., density and regression functions) on complex objects x ∈ X ⊂ d , where d is large.
Much statistical research has revolved around adapting classical methods, such as linear, kernel-weighted, and additive models to high dimensions.On the other hand, statisticians have paid little attention to the potential of orthogonal series approaches.In low dimensions, orthogonal series is a powerful nonparametric technique for estimating densities and regression functions.Such methods are fast to implement with easily interpretable results, they have sharp optimality properties, and a wide variety of bases allows the data analyst to model multiscale structure and any challenging shape of the target function [16].As a result, Fourier series approaches have dominated research in signal processing and mathematical physics.This success, however, has not translated to more powerful nonparametric tools in dimensions of the order of d ∼ 100 or 1000; in fact, extensions via tensor products (as well as more sophisticated adaptive grid or triangulation methods; see [31] and references within) quickly become unpractical in dimensions d > 3.
In this work, we will build on ideas from harmonic analysis and spectral methods to construct nonparametric methods for estimating unknown functions in high-dimensional spaces with non-standard data objects (such as images, spectra, and distributions) that possess sparse nonlinear structure.We derive a Fourier-like basis {ψ j (x)} j∈N of L 2 (X ) that adapts to the intrinsic geometry of the underlying data distribution P , and which is orthonormal with respect to P rather than the Lebesgue measure of the ambient space.The empirical basis functions are then used to estimate functions on complex data x ∈ X ; such as, for example, the regression function r(x) = E(Y |X = x) of a response variable Y on an object x.Because of the adaptiveness of the basis, there is no need for high-dimensional tensor products.Moreover, we take advantage of the orthogonality property of the basis for fast computation and model selection.We refer to our approach as spectral series as it is based on spectral methods (in particular, diffusion maps [13,11,29] and spectral connectivity analysis [30]) and Fourier series.Sections 2.1-2.3 describe the main idea of the series method in a regression setting.
Our work generalizes and ties together ideas in classical smoothing, kernel machine learning [44,45,14], support vector machines (SVMs; [49]) and manifold regularization [6] without the many restrictive assumptions (fixed kernel, exact manifold, infinite unlabeled data and so on) seen in other works.There is a large literature on SVMs and kernel machine learning that use similar approximation spaces as us, but it is unclear whether and how those procedures adapt to the structure of the data distribution.Generally, there is a discrepancy between theoretical work on SVMs, which assume a fixed RKHS (e.g., a fixed kernel bandwidth), and applied SVM work, where the RKHS is chosen in a data-dependent way (by, e.g., decreasing the kernel bandwidth ε n for larger sample sizes n).Indeed, issues concerning the choice of tuning parameters, and their relation to the data distribution P , are considered to be open problems in the mainstream RKHS literature.The manifold regularization work by Belkin et al. [6] addresses adaptivity to sparse structure but under restrictive assumptions, such as the existence of a well-defined submanifold and the presence of infinite unlabeled data.
Another key difference between our work and kernel machine learning is that we explicitly compute the eigenfunctions of a kernel-based operator and then use an orthogonal series approach to nonparametric curve estimation.Neither SVMs nor manifold regularizers exploit orthogonality relative to P .In our paper, we point out the advantages of an orthogonal series approach in terms of computational efficiency (such as fast cross-validation and tuning of parameters), visualization, and interpretation.SVMs can sometimes have a "black box feel," whereas the spectral series method allows the user to directly link the data-driven Fourier-like eigenfunctions to the function of interest and the sample space.Indeed, there is a dual interpretation of the computed eigenfunctions: (i) They define new coordinates of the data which are useful for nonlinear dimensionality reduction, manifold learning, and data visualization.(ii) They form an orthogonal Hilbert basis for functions on the data and are a means to nonparametric curve estimation via the classical orthogonal series method, even when there is no clearly defined manifold structure.There is a large body of work in the machine learning literature addressing the first perspective; see, e.g., Laplacian maps [3], Hessian maps [15], diffusion maps [13], Euclidean Commute Time maps [41], and spectral clustering [46].In this paper, we are mainly concerned with the second view, i.e., that of estimating unknown functions on complex data objects and understanding the statistical properties of such estimators.
Fig. 1, for example, shows a 2D visualization of the Isomap face data using the eigenvectors of a renormalized Gaussian kernel as coordinates (Eq.4).Assume we want to estimate the pose Y of the faces.How does one solve a regression problem where the predictor x is an entire image?Traditional methods do not cope well with this task while our spectral series approach (Eq. 1 with estimated eigenfunctions as a basis) can use complex aggregate objects x (e.g., images, spectra, trajectories, and text data) as predictors, without an explicit dimension reduction step.Note that the eigenvectors capture the pose Y and other continuous variations of an image x fairly well, and that the regression E(Y |x) appears to vary smoothly in sample space.We will return to the face pose estimation problem in Sec.6.1.We will also discuss the theoretical properties of a spectral series estimator of the regression function f (x) = E(Y |x) in Sec. 5, including the connection between smoothness and efficient estimators.Our paper has the following aims: (i) Unifying.To generalize and connect ideas in kernel machine learning, manifold learning, spectral methods and classical smoothing, without the many restrictive assumptions (fixed kernel, exact manifold, infinite unlabeled data, low dimension) seen in other works.(ii) Theoretical.To present new theoretical results in the limit of the kernel bandwidth ε n → 0 that shed light on why RKHS/SVM methods often are so successful for complex data with sparse structure (Theorem 14 and Corollary 16), and to link smoothness of the regression with respect to P to the approximation error of spectral series (Theorem 10).(iii) Experimental.To systematically compare the statistical as well as the computational performance of spectral series and other methods using simulated and real-world data.Competing estimators include classical kernel smoothing, k-nearest neighbors (kNN) regression, regularization in RKHS, and recent state-of-the-art manifold and local regression methods.We ask questions such as: Do the methods scale well with increasing dimension d and increasing sample size n?What is the estimated loss and what is the computational time?
The paper is organized as follows.In Sec. the spectral series method, including details on how to estimate relevant quantities from empirical data and how to tune model parameters.Sec. 3 discusses the connection to related work in machine learning and statistics.In Sections 4 and 5, we discuss the choice of kernel, and provide theoretical guarantees on the spectral series method.Finally, in Sec.6, we compare the performance of spectral series and other nonparametric estimators for a wide variety of data sets.

General Formulation
In low dimensions, orthogonal series has proved to be a powerful technique for nonparametric curve estimation [16].In higher dimensions, there is the question of whether one can find an appropriate basis and actually construct a series estimator that performs well.The general set-up of an orthogonal series regression is otherwise simple: Let X 1 , . . ., X n be an iid sample from a distribution P with compact support X ⊂ R d .Suppose we have a real-valued response where f is an unknown function, and i denotes iid random noise with mean zero and variance σ 2 .Our goal is to estimate the regression function in situations where d is large and the data have sparse (i.e., lowdimensional) structure.Let {ψ j } j∈N be an orthonormal basis of some appropriate Hilbert space H with inner product •, • H and norm • H .We consider estimators of the form where J is a smoothing parameter, and ψ j and β j , in the general case, are data-based estimators of the basis functions ψ j and the expansion coefficients

What Basis?
A challenging problem is how to choose a good basis.The standard approach in nonparametric curve estimation is to choose a fixed known basis {ψ j } j∈N for, say, L 2 ([0, 1]), such as a Fourier or wavelet basis.There is then no need to estimate basis functions.In theory, such an approach can be extended to, eg., L 2 ([0, 1] d ) by a tensor product, 1 but tensor-product bases, as well as more 1 Traditional orthogonal series estimators require d − 1 tensor products in R d .For instance, if d = 2, then it is common to choose a basis of the form {ψ i,j (x) = ψ i (x 1 )ψ j (x 2 ) : i, j ∈ N} , where x = (x 1 , x 2 ), and {ψ i (x 1 )} i and {ψ j (x 2 )} j are bases for functions in L2 (R).sophisticated adaptive grid or triangulation methods (see [31] and references within), quickly become unusable for even as few as d = 5 dimensions.
What basis should one then choose when the dimension d is large, say, d ∼ 1000?Ideally, the basis should be able to adapt to the underlying structure of the data distribution.This means: The basis should be orthogonal with respective to the distribution P that generates the data, as opposed to the standard Lebesgue measure of the ambient space; the basis vectors should be concentrated around high-density regions where most of the "action" takes place; and the performance of the final series estimator should depend on the intrinsic rather than the ambient dimension of the data.In what follows, we present a spectral series approach where the unknown function is expanded into the estimated eigenfunctions of a kernel-based integral operator.As we shall see, the proposed estimator has many of the properties listed above.

Construction of Adaptive Basis
Our starting point is a symmetric and positive semi-definite (psd) so-called Mercer kernel k : X × X → R.These kernels include covariance functions and polynomial kernels, but we are in this work primarily interested in local, radially symmetric kernels k ε (x, y) = g x−y √ ε , 2 where ε is a parameter that defines the scale of the analysis, and the elements k(x, y) are positive and bounded for all x, y ∈ X .To simplify the theory, we renormalize the kernel according to where p ε (x) = X k ε (x, y)dP (y).This normalization is common in spectral clustering because it yields eigenvectors that act as indicator functions of connected components [53,Section 3.2].The same normalization is also implicit in traditional Nadaraya-Watson kernel smoothers, which compute the local average at a point x by weighting surrounding points x i by kε(x,xi) i kε(x,xi) .We will refer to a ε (x, y) in Eq. 2 as the diffusion kernel.The term "diffusion" stems from a random walks view over the sample space [32,30]: One imagines a Markov chain on X with transition kernel Ω ε (x, A) = P(x → A) = A a ε (x, y)dP (y).Starting at x, this chain moves to points y close to x, giving preference to points with high density p(y).The chain essentially encodes the "connectivity" of the sample space relative to p, and it has a unique stationary distribution S ε given by , where S ε (A) → A p(x)dP (x) X p(x)dP (x) as ε → 0. For finite ε, the stationary distribution S ε is a smoothed version of P .
In our regression setting, we seek solutions from a Hilbert space associated with the kernel a ε .Following [30], we define a "diffusion operator" A ε -which maps a function f to a new function A ε f -according to The operator A ε has a discrete set of non-negative eigenvalues λ ε,0 = 1 ≥ λ ε,1 ≥ . . .≥ 0 with associated eigenfunctions ψ ε,0 , ψ ε,1 , . .., which we for convenience normalize to have unit norm.These eigenfunctions have two very useful properties: First, they are orthogonal with respect to the density-weighted L 2 inner product Second, they also form a set of oscillatory functions which are concentrated around high-density regions.By construction, ψ ε,0 is a constant function, and the higher-order eigenfunctions are increasingly oscillatory.Generally speaking, ψ ε,j is the smoothest function relative to P , subject to being orthogonal to ψ ε,i for i < j.
Interpretation.The diffusion operator A ε and its eigenfunctions contain information about the connectivity structure of the sample space.There are two ways one can view the eigenfunctions ψ ε,0 , ψ ε,1 , ψ ε,2 , . ..: (i) The eigenfunctions define new coordinates of the data.If the data x represent high-dimensional complex objects, there is often no simple way of ordering the data.However, by a so-called "eigenmap" x → (ψ ε,1 (x), ψ ε,2 (x), . . ., ψ ε,J (x)), one can transform the data into an embedded space where points that are highly connected are mapped close to each other [29].The eigenmap can be used for data visualization as in Fig. 1 and Fig. 6.If we choose J < d, then we are effectively reducing the dimensionality of the problem by mapping the data from R d to R J .(ii) The eigenfunctions form a Hilbert basis for functions on the data.More specifically, the set ψ ε,0 , ψ ε,1 , . . . is an orthogonal basis of L 2 (X , P ).The value of this result is that we can express most physical quantities that vary as a function of the data as a series expansion of the form f (x) = ∞ j=0 β ε,j ψ ε,j (x).In this work, we study the second point of view and its implications on nonparametric estimation in high dimensions.

Estimating the Regression Function from Data
In practice, of course, we need to estimate the basis {ψ ε,j } j and the projections {β ε,j } j from data.In this section, we describe the details.
Given X 1 , . . ., X n , we compute a row-stochastic matrix A ε , where for i, j = 1, . . ., n.The elements A ε (i, j) can be interpreted as transition probabilities A ε (i, j) = P(x i → x j ) for a Markov chain over the data points (i.e., this is the discrete analogue of Eq. 2 and a diffusion over X ).Let p ε (x) = The Markov chain has a unique stationary measure given by ( s ε (X 1 ), . . ., s ε (X n )), where the ith element is a kernel-smoothed density estimate at the ith observation.
To estimate the eigenfunctions ψ ε,1 , . . ., ψ ε,J of the continuous diffusion operator A ε in Eq. 3, we first calculate the eigenvalues λ A ε,1 , . . ., λ A ε,J and the associated (orthogonal) eigenvectors ψ A ε,1 , . . ., ψ A ε,J of the symmetrized kernel matrix A ε , where We normalize the eigenvectors so that 1 . ., n and j = 1, . . ., J. By construction, it holds that the λ A ε,j 's and ψ A ε,j 's are eigenvalues and right eigenvectors of the Markov matrix A ε : where Note that the n-dimensional vector ψ A ε,j can be regarded as estimates of ψ ε,j (x) at the observed values X 1 , . . ., X n .In other words, let for i = 1, . . ., n.We estimate the function ψ ε,j (x) at values of x not corresponding to one of the X i 's using the so-called Nyström method.The idea is to first rearrange the eigenfunction-eigenvalue equation λ ε,j ψ ε,j = A ε ψ ε,j as and use the kernel-smoothed estimate for λ ε,j > 0.
Our final regression estimator is defined by Eq. 1 with the estimated eigenvectors in Eq. 11 and expansion coefficients computed according to Remark 1 (Semi-Supervised Learning, SSL).The spectral series framework naturally extends to semi-supervised learning (SSL) [60] where in addition to the "labeled" sample (X 1 , Y 1 ), . . ., (X n , Y n ) there are additional "unlabeled" data; i.e., data X n+1 , . . ., X n+m where the covariates X i but not the labels Y i are known.Typically m n, as collecting data often is less costly than labeling them.By including unlabeled examples (drawn from the same distribution P X ) into the kernel matrix A ε , we can improve our estimates of λ ε,j , ψ ε,j and S ε .The summation in Equations 9 and 11 will then be over all n + m observations, while Eq. 12 remains the same as before.See e.g.[34,58] for SSL with Laplacian eigenmaps in the limit of infinite unlabeled data, i.e., in the limit m → ∞.

Loss Function and Tuning of Parameters
We measure the performance of an estimator f (x) via the L 2 loss function To choose tuning parameters (such as the kernel bandwidth ε and the number of basis functions J), we split the data into a training and a validation set.For each choice of ε and a sufficiently large constant J max , we use the training set and Eqs.11-12 to estimate the eigenvectors ψ ε,1 , . . ., ψ ε,Jmax and the expansion coefficients β ε,0 , . . ., β ε,Jmax .We then use the validation set (X 1 , Y 1 ), . . ., (X n , Y n ) to compute the estimated loss for different values of J ≤ J max .We choose the (ε, J)-model with the lowest estimated loss on the validation set.
The computation for fixed ε and different J is very fast.Due to orthogonality of the basis, the estimates β ε,j and ψ ε,j depend on ε but not on J.

Scalability
The spectral series estimator is faster than most traditional approaches in high dimensions.Once the kernel matrix has been constructed, the eigendecomposition takes the same amount of time for all values of d.
In terms of scalability for large data sets, one can dramatically reduce the computational cost by implementing fast approximate eigendecompositions.For example, the Randomized SVD by Halko et al. [22] cuts down the cost from O(n 3 ) to roughly O(n 2 ) with little impact on statistical performance (see Fig. 9).According to Halko et al., these randomized methods are especially well-suited for parallel implementation, which is a topic we will explore in future work.

Linear Regression with Transformed Data
One can view our series model as a (weighted) linear regression after a data transformation Z = Ψ(X), where Ψ = (ψ 1 , . . ., ψ J ) are the first J eigenvectors of the diffusion operator A ε .By increasing J, the dimension of the feature space, we achieve more flexible, fully nonparametric representations.Decreasing J adds more structure to the regression, as dictated by the eigenstructure of the data.
Eq. 12 is similar to a weighted least squares (WLS) solution to a linear regression in (Z 1 , Y 1 ), . . ., (Z n , Y n ) but with an efficient orthogonal series implementation and no issues with collinear variables.Define the n × (J + 1) matrix of predictors, and introduce the weight matrix where Ψ j and s are estimated from data (Equations 6 and 10).Suppose that T , and the random vector e = ( 1 , . . ., n ) T represents the errors.By minimizing the weighted residual sum of squares we arrive at the WLS estimator where the matrix W puts more weight on observations in high-density regions.This expression is equivalent to Eq. 12.
Note that thanks to the orthogonality property Z T WZ = nI, model search and model selection are feasible even for complex models with very large J.This is in clear contrast with standard multiple regression where one needs to recompute the β j estimates for each model with a different J, invert the matrix Z T WZ, and potentially deal with inputs (columns of the design matrix Z) that are linearly dependent.
Remark 2 (Heteroscedasticity).More generally, let σ(x) be a non-negative function rather than a constant, and let i be iid realizations of a random variable with zero mean and unit variance.Consider the regression model We can handle heteroscedastic errors by applying the same framework as above to a rescaled regression function g(x) = f (x)/σ(x).

Kernel Machine Learning and Regularization in RKHS
Kernel-based regularization methods use similar approximation spaces as us.In kernel machine learning [45,14], one often considers the variational problem where L(y i , f (x i )) is a convex loss function, γ > 0 is a penalty parameter, and H k is the Reproducing Kernel Hilbert Space (RKHS) associated with a symmetric positive semi-definite kernel k. 3 Penalizing the RKHS norm • H k imposes smoothness conditions on possible solutions.Now suppose that where the RKHS inner product is related to the L 2 -inner product according to Eq. 19 is then equivalent to considering eigen-expansions and seeking solutions to min f ∈Br is a ball of the RKHS H k with radius r, and the RKHS norm is given by . 3 To every continuous, symmetric, and positive semi-definite kernel k : X × X → R is associated a unique RKHS H k [1].This RKHS is defined to be the closure of the linear span of the set of functions {k(x, •) : x ∈ X } with the inner product satisfying the reproducing property Here are some key observations: (i) The above setting is similar to ours.The regularization in Eq. 19 differentially shrinks contributions from higher-order terms with small λ j values.In spectral series, we use a projection (i.e., a basis subset selection) method, but the empirical performance is usually similar.
(ii) There are some algorithmic differences, as well as differences in how the two regression estimators are analyzed and interpreted.In our theoretical work, we consider Gaussian kernels with flexible variances; that is, we choose the approximation spaces in a data-dependent way (cf.multi-kernel regularization schemes for SVMs [56]) so that the estimator can adapt to sparse structure and the intrinsic dimension of the data.Most theoretical work in kernel machine learning assume a fixed RKHS.
(iii) There are also other differences.Support Vector Machines [49] and other kernel-based regularization methods (such as splines, ridge regression and radial basis functions) never explicitly compute the eigenvectors of the kernel.Instead, these methods rely on the classical Representer Theorem [54] which states that the solution to Eq. 19 is a finite expansion of the form f The original infinite-dimensional variational problem is then reduced to a finitedimensional optimization of the coefficients α i .In a naive least-squares implementation, however, one has to recompute these coefficients for each choice of the penalty parameter γ, which can make cross-validation cumbersome.In our spectral series approach, we take advantage of the orthogonality of the basis for fast model selection and computation of the β j parameters.As in spectral clustering, we also use eigenvectors to organize and visualize data that can otherwise be hard to interpret.

Manifold Regularization and Semi-Supervised Learning
Our spectral series method is closely related to Laplacian-based regularization: In [6], Belkin et al. extend the kernel-based regularization framework to incorporate additional information about the geometric structure of the marginal P X .Their idea is to add a data-dependent penalty term to Eq. 19 that controls the complexity as measured by the geometry of the distribution.Suppose that one is given labeled data (X 1 , Y 1 ), . . ., (X n , Y n ) ∼ P X,Y as well as unlabeled data X n+1 , . . ., X n+m ∼ P X , where in general m n.(The limit m → ∞ corresponds to having full knowledge of P X .)Under the assumption that the support of P X is a compact submanifold of R d , the authors propose minimizing a graph-Laplacian regularized least squares function min where W i,j are the edge weights in the graph, and the last Laplacian penalty term favors functions f for which f (x i ) is close to f (x j ) when x i and x j are connected with large weights.Note that the eigenbasis of our row-stochastic matrix A ε minimizes the distortion i,j (f (x i ) − f (x j )) 2 W i,j in Eq.21 if you regard the entries of A ε as the weights W i,j [3].Indeed, the eigenvector ψ ε,j minimizes the term subject to being orthogonal to ψ ε,i for i < j.Hence, including a Laplacian penalty term is comparable to truncating the eigenbasis expansion in spectral series.Moreover, the semi-supervised regularization in Eq.21 is similar to a semi-supervised version of our spectral series approach, where we first use both labeled and unlabeled data and a kernel with bandwidth ε m+n to compute the eigenbasis, and then extend the eigenfunctions according to Eq. 11 via a (potentially wider) kernel with bandwidth h n .The main downside of the Laplacian-based framework above is that it is hard to analyze theoretically.As with other kernel-based regularizers, the method also does not explicitly exploit eigenvectors and orthogonal bases.

Choice of Kernel
In the RKHS literature, there is a long list of commonly used kernels.These include, e.g., the Gaussian kernel k(x, y) = exp − x−y 2 4ε , polynomial kernels k(x, y) = ( x, y + 1) q [52], and the thin-plate spline kernel k(x, y) = x − y 2 log( x − y 2 ) [20].In our numerical experiments (Sec.6), we will consider both Gaussian and polynomial kernels, but throughout the rest of the paper, we will primarily work with the (row-normalized) Gaussian kernel.There are several reasons for this choice: (i) The Gaussian kernel can be interpreted as the heat kernel in a manifold setting [3,21].We will take advantage of this connection in the theoretical analysis of the spectral series estimator (Sec.5).(ii) The eigenfunctions of the Gaussian kernel are simultaneously concentrated in time (i.e., space) and frequency, and are particularly well-suited for estimating functions that are smooth with respect to a low-dimensional data distribution.
The following two examples illustrate some of the differences in the eigenbases of Gaussian and polynomial kernels: Example 3. Suppose that P is a uniform distribution U (−1, 1) on the real line.Fig. 2, left, shows the eigenfunctions of a third-order polynomial kernel k(x, y) = ( x, y + 1) 3 .These functions are smooth but have large values outside the support of P .Contrast this eigenbasis with the eigenfunctions in Fig. 2, right, of a Gaussian kernel.The latter functions are concentrated on the support of P and are orthogonal on (−1, 1) as well as on (−∞, ∞).Example 4. Consider data around a noisy spiral: where u is a uniform random variable, and x and y are normally distributed random variables.The eigenfunctions of a polynomial kernel do not adapt well to the underlying distribution of the data.Fig. 3, left, for example, is a contour plot of the Nyström extension of the fourth empirical eigenvector of a third-order polynomial kernel.In contrast, the eigenfunctions of a Gaussian diffusion kernel vary smoothly along the spiral direction, forming a Fourier-like basis with orthogonal eigenfunctions that concentrate around high-density regions; see Fig. 3, right.In high dimensions, Gaussian extensions can be seen as a generalization of prolate spheroidal wave functions [12].Prolates were originally introduced by Slepian and Pollack as the solution to the problem of simultaneously and optimally concentrating a function and its Fourier content (see [48] for a fascinating recount of this development in Fourier analysis and modeling).The band-limited functions that maximize their energy content within a space domain X ⊂ R n are extensions of the eigenfunctions of the integral operator of a Bessel kernel restricted to X [12, Section 3.1].In high dimensions, Bessel and Gaussian kernels are equivalent [43], suggesting that the eigenfunctions of the Gaussian kernel are nearly optimal.
However, although Gaussian kernels have many advantages, they may not always be the best choice in practice.Ultimately, this is determined by the application and by what the best measure of similarity between two data points would be.Our framework suggests a principled way of selecting the best kernel for regression: Among a set of reasonable candidate kernels, choose the estimator with the smallest empirical loss according to Eq. 14.We will, for example, use this approach in Sec.6 to choose the optimal degree q for a set of polynomial kernels of the form k(x, y) = ( x, y + 1) q .
Normalization of Local Kernels.In the RKHS literature, it is standard to work with "unnormalized" kernels.In spectral clustering [53], on the other hand, researchers often use the "stochastic" and "symmetric" normalization schemes in Eq. 5 and Eq. 7, respectively.We have found (Sec.6) that the exact normalization often has little effect on the performance in regression.Nevertheless, we choose to use the row-stochastic kernel for reasons of interpretation and analysis: First, the limit of the bandwidth ε → 0 is well-defined, and there is a series of works on the convergence of the graph Laplacian to the Laplace-Beltrami imsart-ejs ver.2014/01/08 file: spectral_series.texdate: February 2, 2016 operator on Riemannian manifolds [11,5,24,47,19].Fourier functions originate from solving a Laplace eigenvalue problem on a bounded domain; hence, the eigenfunctions of the diffusion operator can be seen as a generalization of Fourier series to manifolds.
Moreover, the row-stochastic kernel yields less variable empirical functions than the unnormalized or symmetric forms.As an illustration, consider the noisy spiral data in Example 4. Fig. 4 shows the estimated projections onto the spiral direction of the eigenfunctions of the symmetric and the stochastic forms; see the left and right plots, respectively.The eigenfunctions are clearly smoother in the latter case.By construction, the empirical eigenfunctions of the symmetric operator are orthogonal with respect to the empirical distribution P n , whereas the estimated eigenfunctions of the stochastic operator are orthogonal with respect to the smoothed data distribution S ε .The kernel bandwidth ε n defines the scale of the analysis.

Theory
In this section, we derive theoretical bounds on the loss (Eq.13) of a series regression estimator with a radial kernel for a standard fixed RKHS setting (Theorem 13), as well as a setting where the kernel bandwidth ε n varies with the sample size n (Theorem 14).We also further elaborate on the connection between spectral series and Fourier analysis by generalizing the well-known link between Sobolev differentiable signals and the approximation error in a Fourier basis.
Using the same notation as before, let where β ε,j = X f (x)ψ ε,j (x)dS ε (x) and and refer to the two terms as "bias" and "variance".Hence, we define the integrated bias and variance and and bound the two components separately.Our assumptions are: (A1) P has compact support X and bounded density 0 < a ≤ p(x) ≤ b < ∞, ∀x ∈ X .
(A2) The weights are positive and bounded; that is, ∀x, y ∈ X , where m and M are constants that do not depend on ε.
Without loss of generality, we assume that the eigenfunctions ψ ε,j are estimated using an unlabeled sample X 1 , . . ., X n that is drawn independently from the data used to estimate the coefficients β ε,j .This is to simplify the proofs and can always be achieved by splitting the data in two sets.

Bias
A key point is that the approximation error of the regression depends on the smoothness of f relative to P .Here we present two different calculations of the bias based on two related notions of smoothness.The first notion is standard in the kernel literature and is based on RKHS norms.The second notion is based on our diffusion framework and can be seen as a generalization of Sobolev differentiability.
, where ε is a strictly positive number.Under previous assumptions, this kernel is symmetric and psd with a unique RKHS which we denote by H ε .A standard way to measure smoothness of a function f in a RKHS H ε is through the RKHS norm f Hε (see, e.g., [33]).One can then define function classes where M is a positive number dependent on ε.
Method 2: Smoothness measured by diffusion operator.
Alternatively, let where I is the identity.The operator G ε has the same eigenvectors ψ ε,j as the differential operator A ε .Its eigenvalues are given by −ν 2 ε,j = λε,j −1 ε , where λ ε,j are the eigenvalues of A ε .Define the functional which maps a function f ∈ L 2 (X , P ) into a non-negative real number.For small ε, J ε (f ) measures the variability of the function f with respect to the distribution P .The expression is a variation of the graph Laplacian regularizers popular in semi-supervised learning [59].In fact, a Taylor expansion yields In classical regression, it is removed by using local linear smoothing [17], which is asymptotically equivalent to replacing the original kernel k ε (x, y) by the bias-corrected kernel k * ε (x, y) = kε(x,y) pε(x)pε(y) [11].The following result bounds the approximation error of an orthogonal series expansion of f .The bound is consistent with Theorem 2 in [58], which applies to the more restrictive setting of SSL with infinite unlabeled data and ε → 0. Our result holds for all ε and J and does not assume unlimited data.

Smoothness and Sparsity
In the limit ε → 0, we have several interesting results, including a generalization of the classical connection between Sobolev differentiability and the error decay of Fourier approximations [31, Section 9.1.2]to a setting with adaptive bases and high-dimensional data.We denote the quantities derived from the bias-corrected kernel k * ε by A * ε , G * ε , J * ε and so forth.Definition 7. (Smoothness relative to P) where S(A) = A p(x)dP (x) p(x)dP (x) is the stationary distribution of the random walk on the data as ε → 0. The smaller the value of c, the smoother the function.Lemma 8.For functions f ∈ C 3 (X ) whose gradients vanish at the boundary, This is similar to the convergence of the (un-normalized) graph Laplacian regularizer to the density-dependent smoothness functional X ∇f (x) 2 p 2 (x)dx [8].
Next we will see that smoothness relative to P (Definition 7) and sparsity (with respect to the L 2 norm) in the eigenbasis of the diffusion operator (Definition 9 below) are really the same thing.As a result, we can link smoothness and sparsity to the rate of the error decay of the eigenbasis approximation.Theorem 10.Assume that B = {ψ 1 , ψ 2 , . ..} are the eigenvectors of with eigenvalues ν 2 j = O(j 2s ) for some s > 1/2.Let f J (x) = j≤J β j ψ j (x).Then, the following two statements are equivalent: Furthermore, sparsity in B (or smoothness relative to P) implies The rate s of the error decay depends on the dimension of the data.We will address this issue in Sec.5.3.

Variance
The matrix A ε (defined in Eq. 5) can be viewed as a perturbation of the integral operator A ε due to finite sampling.To estimate the variance, we bound the difference ψ ε,j − ψ ε,j , where ψ ε,j are the eigenvectors of A ε , and ψ ε,j are the Nyström extensions (Eq.11) of the eigenvectors of A ε .We adopt a strategy from Rosasco et al. [40], which is to introduce two new integral operators that are related to A ε and A ε but both act on an auxiliary4 RKHS H of smooth functions (see Appendix A.2 for details).As before, we write ε n to indicate that we let the kernel bandwidth ε depend on the sample size n. .

Fixed Kernel
In kernel machine learning, it is standard to assume a fixed RKHS H k , e.g., with norm • H k and a fixed kernel k with a bandwidth ε not dependent on n.From Propositions 5 and 12 and under assumptions (A1)-(A4), we then have the following result: where The problem is that H k , M , and the eigenvalues λ ε,j , all depend on ε.This dependence is complicated and poorly understood.Hence, in what follows, we will instead of the RKHS norm use an alternative measure of smoothness based on the diffusion operator (Method 2 in Sec.5.1).This simplifies the theory and will allow us to analyze the dependence of the series estimator on tuning parameters and sparse structure.
Corollary 15.Assume that f ∈ C 3 b (X ) and that the kernel k = k * ε is corrected for bias.Then, for ε n → 0 and nε where ν 2 J+1 is the (J + 1) th eigenvalue of , J (f ) = X ∇f (x) 2 dS(x), and imsart-ejs ver.2014/01/08 file: spectral_series.texdate: February 2, 2016 Some comments on these results: The first term in Eqs.25-27 corresponds to the approximation error of the estimator and decays with J.The second and third terms correspond to the variance.Note that the variance term JO P 1 n is the same as the variance of a traditional orthogonal series estimator in one dimension only; in d dimensions, the variance term for a traditional tensor product basis is O P 1 n d i=1 J i where J i is the number of components in the ith direction [16].Hence, there is a considerable gain in using an adaptive bias, but we incur an additional variance term JO P γ 2 n εn∆ 2 J from estimating the basis. 5f we balance the two ε-terms in Eq. 27, we get a bandwidth of ε n (1/n) 2/(d+4) .With this choice of ε n and by ignoring terms of lower order, the rate becomes Finally, if we apply the results in [11,18,19,40] to general Riemannian manifolds (see, for example, [25, 36?] for kernel density estimation on manifolds), and use that the eigenvalues of the Laplace-Beltrami operator on an r-dimensional Riemannian manifold are ν 2 j ∼ j 2/r [42], we obtain the following corollary: Corollary 16.Suppose the support of the data is on a compact C ∞ submanifold of R d with intrinsic dimension r, and suppose that f is smooth relative to P (Definition 7).Under the assumptions of Theorem 14 and Corollary 15, we obtain the rate .
It is then optimal to take J (n/ log n)

.
We make the following observations for a spectral series estimator with flexible kernel bandwidth: above is a significant improvement of the minimax rate n −1/O(d) for a nonparametric regressor in R d .Our estimator automatically adapts to sparse structure and does not require the knowledge of r or an estimated r in practice.The optimal error rate is achieved when the smoothing parameters J and ε are properly selected for the given r, and the amount of smoothing is in practice chosen by cross-validation as in Sec.2.5.
(ii) Minimax Optimality.In a semi-supervised learning setting, the estimation error of the basis vanishes in the limit of infinite unlabeled data.The loss then reduces to which is minimized by taking J n r/(r+2) .At the minimum, we achieve the rate 1 , the minimax rate for a nonparametric estimator of Sobolev smoothness β = 1 in R D , where D = r.The latter result is also, up to a logarithmic term, in agreement with [58].

Numerical Examples
Finally, we use data with complex dependencies to compare the spectral series approach with classical kernel smoothing, k-nearest neighbors (kNN) regression, regularization in RKHS, and recent state-of-the-art manifold and local regression methods.
In our experiments, we split the data into three sets for training, validation, and testing, respectively.For the manifold regression estimators from Aswani et al [2] and Cheng et al. [10], we use the authors' codes with built-in crossvalidation.For all other estimators, we tune parameters according to Sec. 2.5.To assess the final models, we compute the estimated loss L and standard error on the test data. 6

Estimating Pose Using Images of Faces
In our first example, we consider images of artificial faces from the Isomap database [50]. 7There are a total of n = 698 64 × 64 gray-scale images rendered with different orientation and lighting directions.Fig. 1 shows a visualization of these data where we use the first two non-trivial eigenvectors of the Gaussian diffusion kernel as coordinates (i.e., Eq. 4 with the approximate eigenvectors from Eq. 11).
Our goal is to estimate the horizontal left-right pose of each face.We compare several different approaches to regression: (i) As a baseline, we choose the classical Nadaraya-Watson estimator with a Gaussian smoothing kernel (NW ) and the k-nearest neighbors regression estimator (kNN ).The latter estimator is known to be minimax optimal with respect to local intrinsic dimension [27].
(ii) For the spectral series method (Series), we implement the Gaussian kernel (Series-radial ) and polynomial kernels k(x, y) = ( x, y + 1) q of different degrees q.We treat q as a tuning parameter and we denote the polynomial kernel with the smallest estimated loss by Series-polyBest.Note that choosing q = 1 (Series-poly1 ) is equivalent to a linear regression on eigenvectors computed with PCA.
(iii) We also implement the RKHS method in Sec.3.2 for the same set of kernels as Series.For a squared-error loss, Eq. 19 reduces to an infinite-dimensional, generalized ridge regression problem [23,Section 5.8.2].Hence, we use the term kernel ridge regression (KRR) and denote the estimators by KRR-radial and KRR-poly.
(iv) The last group of estimators include recent manifold and local regression methods [2, 10]8 : locOLS is a local ordinary least squares, locRR is a local ridge regression, locEN is a local elastic net, locPLS is a local partial least squares, locPCR is a local principal components regression, NEDE is the nonparametric exterior derivative estimator, NALEDE is the nonparametric adaptive lasso exterior derivative estimator, NEDEP is the nonparametric exterior derivative estimator for the "large p, small n" case, and NALEDEP is the nonparametric adaptive lasso exterior derivative estimator for the "large p, small n" case.The last 4 regression estimators (NEDE, NALEDE, NEDEP, NALEDP ) pose the regression as a least-squares problem with a term that penalizes for the regression vector lying in directions perpendicular to an estimated manifold; see [2] for details.In our comparison, we also include MALLER [10] which first estimates the local dimension of the data and then performs local linear regression on a tangent plane estimate.
Manifold and local regression methods, unlike Series, quickly become computationally intractable in high dimensions.Hence, to be able to compare the different methods, we follow Aswani et al. [2] and rescale the Isomap images from from 64 × 64 down to 7 × 7 pixels in size.This reduces the number of covariates from d = 4096 to d = 49.In other words, we regress the left-right pose Y (our response) on the rescaled image X ∈ R 49 (our predictor).We use 50% of the data for training, 25% for validation and 25% for testing.All covariates are normalized to have mean 0 and standard deviation 1.
Table 1 and Fig. 5 summarize the results of the final (cross-validated) estimators.The approaches that have best performance are Series-radial and KRR-radial.As expected, Series and KRR estimators yield similar losses.A first-order polynomial kernel, i.e., a global principal component regression with Series-or KRR-poly1, performs worse than NW and kNN.Higher-order polynomial kernels (with degree q = 2 resulting in the smallest loss) as well as the manifold and local regression estimators (in particularly, NEDE and MALLER) improve the NW and kNN results but Series-radial and KRR-radial are still the best choices in terms of statistical and computational performance.q q q q q q q 0.0  1.For visibility, we have divided the estimators into three groups: classical NW kernel and kNN smoothers (left), Series/RKHS-type estimators (center), and manifold regression estimators, such as NEDE and MALLER (right).Bars represent standard errors.

Estimating Redshift Using SDSS Galaxy Spectra
In the following (high-dimensional) example, we predict the redshift of galaxies from high-resolution measurements of their emission spectra.Our initial data sample consists of galaxy spectra from ten arbitrarily chosen spectroscopic plates of SDSS DR6. 9 We preprocess and remove spectra according to the three cuts described in [38].The final sample consists of n = 2812 high-resolution spectra with flux measurements at d = 3501 wavelengths.We renormalize each spectra so that it has unit norm.Our goal is to predict a galaxy's redshift Y where the predictor is an entire spectrum x ∈ R 3501 .Fig. 6a shows an example of a SDSS spectrum.Fig. 6b shows a low-dimensional visualization of the full data set when using the first few vectors of the diffusion basis as coordinates.
Each point in the plot represents a galaxy, and the color codes for the SDSS spectroscopic redshift.The redshift (the response Y ) appears to vary smoothly with the eigencoordinates.For the regression, we use 50% of the data for training, 25% for validation and 25% for testing.Due to the high dimension of the predictor (d = 3501), we are unable to implement the computationally intensive manifold and local regression estimators from [2].Table 2 and Fig. 7 summarize the results for the other approaches to regression.Series and KRR are essentially equivalent in terms of performance, and as before, the radial kernel (Series-radial and KRRradial) yields the smallest estimated loss.For these data, a linear dimensionality reduction with PCA (series-poly1 ) improves upon the NW and kNN regression results.MALLER and higher-order polynomials (with degrees 5 and 6) perform better than PCA, but Series-radial still has the smallest estimated loss.Moreover, MALLER is much slower than Series: the former estimator takes 34 minutes on a 2.70GHz Intel Core i7-4800MQ, whereas Series with cross-validation takes less than a minute.

Increasing Dimension
In terms of computational speed, the spectral series estimator has a clear competitive edge in high dimensions relative local regression procedures and a leastsquares (LS) implementation of Eq. 19 that does not take advantage of orthogonal bases (see, e.g., [4, p. 215] for a LS implementation of SSL learning on manifolds).We illustrate the differences with a one-dimensional manifold embedded in d dimensions.Let Y |x ∼ N (θ(x), 0.5), where the points x = (x 1 , . . ., x d ) lie on a unit circle in R d , and θ(x) is the angle corresponding to the position of x.For simplicity, we simulate data uniformly on the circle; i.e., we let θ(x) ∼ U nif (0, 2π).
imsart-ejs ver.2014/01/08 file: spectral_series.texdate: February 2, 2016 Figure 8 summarizes the results.In terms of estimated loss (left panel), Series performs better than MALLER, and it has a statistical performance similar to the least-squares implementation of kernel ridge regression (KRR-LS ).As predicted by the theory, the loss of Series does not depend on the ambient dimension d.Moreover, the computational time of Series is nearly constant as a function of the dimension d (right panel).KRR-LS is slower than Series, 10 and MALLER becomes computationally intractable as d increases.For d = 2500 and n = 2000, each fit with MALLER takes an average of 354 seconds (6 minutes) on an Intel i7-4800MQ CPU 2.70GHz processor, compared to 72 seconds for Series.q q q q q Series KRR−LS MALLER

Dimension
Computational Time q q q q q q q Series KRR−LS MALLER

Increasing Sample Size
Here we revisit the redshift prediction problem in Sec.6.2 using galaxy spectra from SDSS DR 12. 11 We increase the size of the training set for a fixed number of 1000 validation spectra and 1000 test spectra.Fig. 9

Sample Size
Estimated Loss q q q q q q q Series RSVD Series Computational Time q q q q q q q Series RSVD Series KRR−LS MALLER Estimated loss (left) and computational time (right) as a function of the sample size for different regression estimators.Note that Randomized SVD (Series RSVD) dramatically reduces the computational time, see right, for large sample sizes with little impact on the statistical performance, see left.

Discussion
Our spectral series method can handle complex high-dimensional data objects in many settings where traditional nonparametric methods either perform poorly or are computationally intractable.The series method offers a compression of the data in terms of Fourier coefficients; it is computationally efficient (with regards to the dimension and size of the sample), and it returns orthogonal basis functions that adapt to low-dimensional structure in the data distribution.As a result, there is no need for cumbersome tensor products in high dimensions.
Our work shows that for a Gaussian kernel with a flexible bandwidth, the computed eigenfunctions form a Fourier-like orthogonal basis for expressing smoothness relative to the underlying data distribution.More precisely, if a function is smooth with respect to the data distribution, then it is sparse in the eigenbasis with respect to the L 2 norm, and vice versa (Theorem 10).Indeed, in the limit of the sample size n → ∞, spectral series with a Gaussian kernel can be seen as a generalization of Fourier series to high dimensions and sparse structure (Sec.4).
The two main theorems 13 and 14 provide theoretical bounds on the loss of the final regression estimator for a standard fixed RKHS setting as well as a setting where the kernel bandwidth varies with the sample size n.We show that spectral series regression with a Gaussian kernel is adaptive to intrinsic dimension when the bandwidth ε n → 0 (Corollary 16).In the case of a submanifold with dimension r embedded in R d , the convergence rate of the estimator depends on the manifold dimension r rather than the ambient dimension d.The adaption occurs automatically and does not involve manifold estimation.Unlike [7], there is also no need to estimate the dimension of the manifold.We have found that unless the goal is manifold estimation, there is little advantage in using manifold and local linear regression methods.Such methods quickly become computationally intractable in high dimensions without a prior dimension reduction.On the other hand, the computational speed of spectral series does not depend on the ambient or intrinsic dimension of the data.Moreover, it is unclear how manifold-based methods behave in more complex settings where there is sparse structure (e.g., high-density regions and clusters) but no well-defined submanifold.
Because of the close connection between spectral series and SVMs, we expect that our new findings (regarding adaptiveness, choice of kernel and the bandwidth) will apply to kernel-based regularized empirical risk minimizers as well.Indeed, our empirical results (Tables 1 and 2) confirm that the performance of KRR using a Gaussian kernel with a flexible bandwidth is similar to that of spectral series regression.This suggests that one can exploit the advantages of spectral series in terms of interpretation, visualization, and analysis without any real down-sides.In the process of analyzing the performance of the spectral series estimator, we shed light on the empirical success of SVMs for sparse data, and we unify ideas from Fourier analysis, kernel machine learning and spectral clustering.
Future work includes deriving tighter bounds for the convergence rate of spectral series and kernel-based empirical regularizers.We believe that our estimated rates are on the conservative side as our derivations assume that the eigenvectors need to be accurately estimated.Empirical experiments, however, indicate that spectral series with approximate eigenvectors already outperform the k-nearest neighbor estimator which is minimax optimal with respect to local intrinsic dimension [27].In a separate paper, we will discuss extensions of spectral series to estimating other unknown functions (e.g., conditional densities, density ratios and likelihoods) for high-dimensional complex data and distributions.Another interesting research question is whether one can further improve the performance of spectral series approaches by adaptive basis selection and nonlinear estimators that threshold the series expansion coefficients | β j | as in wavelet thresholding [31].
In the online supplementary materials, we include sample R code for the spectral series estimator.This code has however not been optimized for speed, as we will leave the large-scale deployment on parallel platforms to future work.Lemma 20.Under the same assumptions as in Proposition 11, it holds that By using Proposition 11, we conclude that where C is a constant.Write Hence, According to Chebyshev's inequality, for any M > 0, Hence, for any M > 0, , where we in the last inequality apply the Cauchy-Schwarz inequality.Under assumption (A4), we conclude the result of the lemma.

A.2. Variance
Let H be an auxiliary RKHS of smooth functions; we use the term "auxiliary" to denote that the space only enters the intermediate derivations and plays no role in the error analysis of the algorithm itself.We define the two integral operators A H , A H : H → H where k ε (x, y) f, K(•, y) H dP (y) k ε (x, y)dP (y) = a ε (x, y) f, K(•, y) H dP (y) = a ε (x, y) f, K(•, y) H d P n (y), and K is the reproducing kernel of H. Define the operator norm A H = sup f ∈H Af H / f H where f 2 H = f, f H . Now suppose the weight function k ε is sufficiently smooth with respect to H (Assumption 1 in [40]); this condition is for example satisfied by a Gaussian kernel on a compact support X .By Propositions 13.3 and 14.3 in [40], we can then relate the functions ψ ε,j and ψ ε,j , respectively, to the eigenfunctions u ε,j and u ε,j of A H and A H .We have that ψ ε,j − ψ ε,j L 2 (X ,P ) = C 1 u ε,j − u ε,j L 2 (X ,P ) ≤ C 2 u ε,j − u ε,j H (30) for some constants C 1 and C 2 .According to Theorem 6 in [39] for eigenprojections of positive compact operators, it holds that where δ ε,j is proportional to the eigengap λ ε,j −λ ε,j+1 .As a result, we can bound the difference ψ ε,j − ψ ε,j L 2 (X ,P ) by controlling the deviation A H − A H H .We choose the auxiliary RKHS H to be a Sobolev space with a sufficiently high degree of smoothness (see below for details).Let H s denote the Sobolev space of order s with vanishing gradients at the boundary; that is, let where D α f is the weak partial derivative of f with respect to the multi-index α, and L 2 (X ) is the space of square integrable functions with respect to the Lebesgue measure.Let C 3 b (X ) be the set of uniformly bounded, three times differentiable functions with uniformly bounded derivatives whose gradients vanish at the boundary.Now consider H ⊂ H s and choose s large enough so that D α f ∈ C 3 b (X ) for all f ∈ H and |α| = s.Under assumptions (A1)-(A4), we derive the following result: where A ε f (x) = a ε (x, y)f (y)dP (y).From [18], Hence, Next, we bound A ε f (x) − A ε f (x).We have Thus, sup f ∈C 3 b (X ) A ε f − A ε f ∞ = O P (γ n ).The Sobolev space H is a Hilbert space with respect to the scalar product We have that sup We bound the contribution to L var from each of these two terms separately: By using Cauchy's inequality and Proposition 11, we have that .
By construction, it holds that 1 n i ψ ε,j ( X i ) ψ ε, ( X i ) s ε ( X i ) = δ j, .Furthermore, for a sample X 1 , . . ., X n drawn independently from X 1 , . . ., X n .Finally, from the orthogonality property of the ψ ε,j 's together with Lemmas 23 and 18, it follows that .

2 Fig 1 :
Fig 1: Embedding or so-called "eigenmap" of the Isomap face data using the first two nontrivial eigenvectors of the Gaussian diffusion kernel.The eigenvectors capture the pose Y and other continuous variations of an image x fairly well, and the regression E(Y |x) appears to vary smoothly in sample space.

Fig 2 :
Fig 2: Uniform distribution U(-1,1).Eigenfunctions of a third-order polynomial kernel, left, and of an (un-normalized) Gaussian kernel, right.The latter eigenfunctions form an orthogonal Fourier-like basis concentrated on the support of the data distribution.

Fig 3 :
Fig 3: Spiral data.Contour plots of the fourth eigenvector of a third-order polynomial kernel (left) and the fourth eigenvector of a Gaussian diffusion kernel (right).The latter eigenvector is localized and varies smoothly along the spiral direction.

Fig 4 :
Fig 4: Projection onto spiral direction (t = √ u) of the eigenvectors of the symmetric graph Laplacian, left, and of the Gaussian diffusion kernel, right.The latter estimates are less noisy.

2 j
where ∇ is the gradient operator and is the psd Laplace operator in R d .In kernel regression smoothing, the extra term ∇p p •∇f is considered an undesirable extra bias, called design bias.
Functions in W B (s, c) are sparse in B. The larger the value of s, the sparser the representation.

Fig 5 :
Fig 5: Estimated loss of estimators for Isomap face data; see Table 1.For visibility, we

Fig 6 :
Fig 6: Left: Example of a SDSS galaxy spectrum.Right: Embedding of a sample of SDSS galaxy spectra using the first three non-trivial eigenvectors of the Gaussian diffusion kernel as coordinates; the color codes for the true redshift.

Fig 7 :
Fig 7: Estimated loss of different estimators for redshift prediction; see Table 2. Bars represent standard errors.For visibility, we have divided the estimators into 3 groups: classical NW kernel and kNN smoothers (left), Series/RKHS-type estimators (center), and the manifold regression estimator MALLER (right).Bars represent standard errors.

Fig 8 :
Fig 8: Increasing the dimension (number of variables) d for a circle embedded in R d .Estimated loss (left) and computational time (right) as a function of the dimension d for different regression estimators.

Fig 9 :
Fig 9: Increasing the size of the training set for redshift prediction using SDSS galaxy spectra.

Table 1
Estimated loss for Isomap face data.

Table 2
Estimated loss for redshift prediction using SDSS galaxy spectra.
indicates massive payoffs in implementing Randomized SVD (Series RSVD) for large data sets; see discussion in Sec.2.6.Even without parallelization, we are able to cut down the computational time with a factor of 15 (right panel) with almost no decrease in statistical performance (left panel).The run time for SVD and KRR-LS when the sample size n=11200 (and the dimension d = 3431) is about 5 hours on an Intel i7-4800MQ CPU 2.70GHz processor.With Randomized SVD, the same regression takes about 20 minutes.