Kernel ridge vs. principal component regression: Minimax bounds and the qualiﬁcation of regularization operators

: Regularization is an essential element of virtually all kernel methods for nonparametric regression problems. A critical factor in the eﬀectiveness of a given kernel method is the type of regularization that is employed. This article compares and contrasts members from a general class of regularization techniques, which notably includes ridge regression and principal component regression. We derive an explicit ﬁnite-sample risk bound for regularization-based estimators that simultaneously accounts for (i) the structure of the ambient function space, (ii) the regularity of the true regression function, and (iii) the adaptability (or qualiﬁcation ) of the regularization. A simple consequence of this upper bound is that the risk of the regularization-based estimators matches the minimax rate in a vari- ety of settings. The general bound also illustrates how some regularization techniques are more adaptable than others to favorable regularity prop- erties that the true regression function may possess. This, in particular, demonstrates a striking diﬀerence between kernel ridge regression and ker- nel principal component regression. Our theoretical results are supported by numerical experiments.


Introduction
Suppose that the observed data consists of z i = (y i , x i ), i = 1, . . . , n, where y i ∈ Y ⊆ R and x i ∈ X ⊆ R d . Suppose further that z 1 , . . . , z n ∼ ρ are iid from some probability distribution ρ on Y × X . Let ρ(· | x) denote the conditional distribution of y i given x i = x ∈ X and let ρ X denote the marginal distribution of x i . Our goal is to use the available data to estimate the regression function of y on x, which minimizes the mean-squared prediction error over ρ X -measurable functions f : X → R. More specifically, for an estimatorf define the risk where the expectation is computed over z 1 , . . . , z n , and · ρ X denotes the norm on L 2 (ρ X ); we seek estimatorsf which minimize R ρ (f ). This is a version of the random design nonparametric regression problem. There is a vast literature on nonparametric regression, along with a huge variety of corresponding methods (e.g., Györfi et al., 2002;Wasserman, 2006). In this paper, we focus on regularization and kernel methods for estimating f † . 1 Most of our results apply to general regularization operators. However, our motivating examples are two well-known regularization techniques: Kernel ridge regression (which we refer to as "KRR"; KRR is also known as Tikhonov regularization) and kernel principal component regression ("KPCR"; also known as spectral cut-off regularization).
Our main theorem is a new upper bound on the risk of a general class of kernel-regularization methods, which includes both KRR and KPCR (Theorem 1). The theorem substantially generalizes previously published bounds (see Section 2 for a discussion of related work) and illustrates the dependence of the risk on three important features: (i) the structure of the ambient reproducing kernel Hilbert space (RKHS), (ii) the specific regularization technique employed, and (iii) the regularity (often interpreted as smoothness) of the function to be estimated. One consequence of the theorem is that the regularization methods studied in this paper (including KRR and KPCR) achieve the minimax rate for estimating f † in a variety of settings. A second consequence is that certain regularization methods (including KPCR, but not KRR) may adapt to favorable regularity of f † to attain even faster convergence rates, while others (notably KRR) are limited in this regard due to a well-known saturation effect (Neubauer, 1997;Mathé, 2005;Bauer et al., 2007). This illustrates a striking advantage that KPCR may have over KRR in these settings.

Related work
Kernel ridge regression has been studied extensively in the literature. Indeed, bounds for KRR that are similar to our Theorem 1 have been derived by Caponnetto and De Vito (2007). Moreover, it is well-known that KRR is minimax in many of the settings considered in this paper, such as those described in Corollaries 1-3 (Caponnetto and De Vito, 2007;Zhang, 2005;Zhang et al., 2013;Lin et al., 2016). However, these cited results apply only to KRR, while the results in this paper apply to a substantially larger class of regularization operators (including, for example, KPCR).
Beyond KRR, there has also been significant research into more general regularization methods, like those considered in this paper. However, our bounds are sharper than previously published results on general regularization operators. For instance, unlike our Theorem 1, the bounds of Bauer et al. (2007) do not illustrate the dependence of the risk on the ambient Hilbert space. Thus, while our approach immediately implies that many of the regularization methods under consideration are minimax optimal, it seems difficult (if not impossible) to draw this conclusion using the approach of Bauer et al. (2007). General regularization operators are also studied by Caponnetto and Yao (2010), but their results apply to semi-supervised settings where an additional pool of unlabeled data is available.
One of the major practical implications of this paper is that KPCR may have significant advantages over KRR in some settings. This has been observed previously by other researchers; others have even noted that Tikhonov regularization (KRR) saturates, while spectral cut-off regularization (KPCR) does not (Mathé, 2005;Bauer et al., 2007;Lo Gerfo et al., 2008). Our results (Theorem 1 and Corollaries 1-3) sharpen these observations by precisely quantifying the advantages of unsaturated regularization operators in terms of adaptability and minimaxity. In other related work, Dhillon et al. (2013) have illustrated the potential advantages of KPCR over KRR in finite-dimensional problems with linear kernels; though their work is not framed in terms of saturation and general regularization operators, it relies on similar concepts.
After the initial submission of our work, we became aware of simultaneous independent work by Blanchard and Mücke (2016) that proves upper bounds on the risk for a comparable class of regularization methods (which includes KRR and KPCR), as well as minimax lower bounds that match the upper bounds; their results are specialized to RKHS's where the covariance operator has polynomially decaying eigenvalues. Compared to that work, our results require a weaker moment condition on the noise for risk bounds, apply to a much broader class of RKHS's, and also consider target functions that live in finite-dimensional subspaces (see Proposition 1). Another distinguishing feature of the present work is that it contains numerical experiments that illustrate the implications of our theoretical bounds in some practical settings (Section 6).
Other important recent work was pointed out to us by a reviewer. Optimal rates in expectation for KRR and more general regularization methods are derived by Lin et al. (2016) and Guo et al. (2016), respectively, for distributed learning problems. The bounds in the latter paper, when specialized to the nondistributed setting, are similar to those in this paper and by Blanchard and Mücke (2016). However, the results in this paper are more general (for nondistributed learning), because they do not require continuity of the kernel and have weaker conditions on the output variable.
The main engine behind the technical results in this paper is a collection of large-deviation results for Hilbert-Schmidt operators. The required machinery is developed in the Appendices. These results build on straightforward extensions of results of Tropp (2015). Our most precise results for KPCR and achieving parametric rates for estimation over finite-dimensional subspaces (Proposition 1) rely on slightly different arguments, which are based on well-known eigenvalue perturbation results that have been adapted to handle Hilbert-Schmidt operators (e.g., the Davis-Kahan sin Θ theorem (Davis and Kahan, 1970)).

Statistical setting and assumptions
Our basic assumption on the distribution of z = (y, x) ∼ ρ is that the residual variance is bounded; more specifically, we assume that there exists a constant σ 2 > 0 such that for almost all x ∈ X. Zhang et al. (2013) also assume (2); this assumption is slightly weaker than the analogous assumption used by Bauer et al. (2007) (equation (1) in their paper). Note that (2) holds if y is bounded almost surely or if y = f † (x) + , where is independent of x, E( ) = 0, and Var( ) = σ 2 . Let K : X × X → R be a symmetric positive-definite kernel function. We assume that K is bounded-i.e., that there exists κ 2 > 0 such that sup x∈X K(x, x) ≤ κ 2 . Additionally, we assume that there is a countable orthonormal basis {ψ j } ∞ j=1 ⊆ L 2 (ρ X ) and a sequence of corresponding real numbers t 2 1 ≥ t 2 2 ≥ · · · ≥ 0 such that where the convergence is absolute and uniform over x,x ∈ X . Mercer's theorem and various generalizations give conditions under which representations like (3) are known to hold (Carmeli et al., 2006); one of the simplest examples is when X is a compact Hausdorff space, ρ X is a probability measure on the Borel sets of X , and K is continuous. Observe that in particular, {t 2 j } ∈ 1 (N). Additionally, t 2 j and ψ j are eigenvalues and eigenfunctions, respectively, of the linear transformation T : for f ∈ L 2 (ρ X ),x ∈ X , where ·, · L 2 (ρ X ) denotes the inner product on L 2 (ρ X ).
Let H ⊆ L 2 (ρ X ) be the RKHS corresponding to K (Aronszajn, 1950) and let φ j = t j ψ j , j = 1, 2, . . .. It follows from basic facts about RKHSs that and the inner product (the corresponding norm is denoted by · H ). Our main assumption on the relationship between y, x, and the kernel K is that This is a regularity or smoothness assumption on f † . Many of the results in this paper can be modified, so that they apply to settings where f † / ∈ H, by replacing f † with an appropriate projection of f † and including an approximation error term in the corresponding bounds. This approach leads to the study of oracle inequalities (Zhang, 2005;Zhang et al., 2013;Koltchinskii, 2006;Steinwart et al., 2009;Hsu et al., 2014), which we do not pursue in detail here. However, investigating oracle inequalities for general regularization operators may be of interest for future research, as most existing work focuses on ridge regularization.
Another interpretation of condition (5) is that it is a minimal regularity condition on f † for ensuring that f † can be estimated consistently using the kernel methods considered below. One key aspect of the upper bounds in Section 5 is that they show f † can be estimated more efficiently, if it satisfies stronger reguarlity conditions (and if the regularization method used is sufficiently adaptable). A convenient way to formulate a collection of regularity conditions with varying strengths, which will be useful in the sequel, is as follows. For ζ ≥ 0, define the Hilbert space Then . Additionally, positive integers J define the finite-rank subspace We have the inclusion H J is an even stronger condition (provided the t 2 j are strictly positive). Conditions such as f † ∈ H ζ and f † ∈ H • J are known as source conditions elsewhere in the literature (e.g., Bauer et al., 2007;Caponnetto and Yao, 2010).

Regularization
As discussed in Section 1, our goal is find estimatorsf that minimize the risk (1).
In this paper, we focus on regularization-based estimators for f † . In order to precisely describe these estimators, we require some additional notation for various operators that will be of interest, and some basic definitions from regularization theory.

Finite-rank operators of interest
Additionally, define the finite-rank linear operators S X : H → R n and T X : H → H (both depending on X) by , where f ∈ H. Let ·, · R n denote the normalized inner-product on R n , defined by v, Then the adjoint of S X with respect to ·, · H and ·, · R n , S * X : R n → H, is given by which is ubiquitous in kernel methods and enables finite computation.

Basic definitions
The main idea behind a regularization family is that it "looks" similar to t → 1/t, but is better-behaved near t = 0, i.e., it is bounded by λ −1 . An important quantity that is related to the adaptability of a regularization family is the qualification of the regularization. The qualification of the regularization family If a regularization family has qualification ξ, we say that it "saturates at ξ." Two regularization families that are the major motivation for the results in this paper are ridge (Tikhonov) regularization, where g λ (t) = 1 t+λ , and principal component (spectral cut-off) regularization, where Observe that ridge regularization has qualification 1 and principal component regularization has qualification ∞. Another example of a regularization family with qualification ∞ is the Landweber iteration, which can be viewed as a special case of gradient descent (see, e.g., Rosasco et al., 2005;Bauer et al., 2007;Lo Gerfo et al., 2008).

Estimators
Given a regularization family {g λ } λ>0 , we define the g λ -regularized estimator for f † ,f Here, g λ is applied to the spectrum of the self-adjoint finite-rank operator T X via the functional calculus (see, e.g., Reed and Simon, 1980, Ch. 7); this is a generalization of standard matrix functions (Tropp, 2015, Ch. 2). Since S * X g λ (K/n) = g λ (T X )S * X , we have the finitely computable representationf λ = S * X g λ (K/n)y = n i=1γ i K xi , where (γ 1 , . . . ,γ n ) = g λ (K/n)y/n; computing theγ i involves an eigenvalue decomposition of the matrix K. The dependence off λ on the regularization family is implicit; our results hold for any regularization family except where explicitly stated otherwise. The estimatorsf λ are the main focus of this paper.

General bound on the risk
Assume that the source condition f † ∈ H ζ holds for some ζ ≥ 0, and that g λ has qualification at least max{(ζ + 1)/2, 1}. Define the effective dimension The following risk bound holds: Theorem 1 is proved in the Appendices. The first two terms in the upper bound (12) are typically the dominant terms. In the upper bound (12), the interaction between the kernel K and the distribution ρ X is reflected in the effective dimension d λ (see, e.g., Zhang, 2005;Caponnetto and De Vito, 2007).
The regularity of f † enters through norm of f † (both the Hand H ζ -norms) and the exponent on λ.
Risk bounds on general regularization estimators similar tof λ were previously obtained by Bauer et al. (2007). However, their bounds (e.g., Theorem 10 of Bauer et al., 2007) are independent of the ambient RKHS H, i.e., they do not depend on the eigenvalues {t 2 j }. Our bounds are tighter than those of Bauer et al. because we take advantage of the structure of H. In contrast with our Theorem 1, the results of Bauer et al. do not give minimax bounds (not easily, at least), because minimax rates must depend on the t 2 j .

Implications for kernels characterized by their eigenvalues' rate of decay
We now state consequences of Theorem 1 that give explicit rates for estimating f † viaf λ , for any regularization family, under specific assumptions about the decay rate of the eigenvalues {t 2 j }. We first consider the case where the eigenvalues have polynomial decay.
Remark 1. Observe that if g λ has qualification at least max{(ζ + 1)/2, 1} (and the other conditions of Corollary 1 are met), thenf λ obtains the minimax rate for estimating functions over H ζ (Pinsker, 1980 Blanchard and Mücke (2016) for essentially the same class of regularization operators. Our Theorem 1, from which the corollary follows, applies to a broader class of kernels than results by Blanchard and Mücke. When the eigenvalues {t 2 j } have exponential or Gaussian-type decay, the rates are nearly the same as in finite dimensions.
Corollary 2. Assume that C, α ≥ 0 are constants such that 0 < t 2 j ≤ Ce −αj for all j = 1, 2, . . .. Assume that g λ has qualification at least 1 and that (11) holds for any 0 < δ ≤ 1. Finally, take λ = C n −1 log(n) for a suitable constant C > 0 so that the conditions on λ from Theorem 1 are satisfied. Then where the constants implicit in the big-O may depend on κ 2 , C, C , α, δ, and κ 2 δ , but nothing else. Corollary 3. Assume that C, α ≥ 0 are constants such that 0 < t 2 j ≤ Ce −αj 2 for all j = 1, 2, . . .. Assume that g λ has qualification at least 1 and that (11) holds for any 0 < δ ≤ 1. Finally, take λ = C n −1 log(n) for a suitable constant C > 0 so that the conditions on λ from Theorem 1 are satisfied. Then where the constants implicit in the big-O may depend on κ 2 , C, C , α, δ, and κ 2 δ , but nothing else. Remark 3. In Corollaries 2 -3, we get minimax estimation over H = H 0 (Pinsker, 1980). However, our bounds are not refined enough to pick up any potential improvements which may be had if a stronger source condition is satisfied (e.g., f † ∈ H ζ for ζ > 0). This is typical in settings like this because the minimax rate is already quite fast, i.e., within a log-factor of the parametric rate n −1 .

Parametric rates for finite-dimensional kernels and subspaces
If the kernel has finite rank (i.e., t 2 j = 0 for j sufficiently large), then it follows directly from Theorem 1 that If the kernel has infinite rank, but f † is contained in the finite-dimensional subspace H • J for some J < ∞, then Theorem 1 can still be applied, provided g λ has high qualification. Indeed, if g λ has infinite qualification and f † ∈ H • J , then it follows that f † ∈ H ζ for all ζ ≥ 0 and Theorem 1 implies that for any 0 H + σ 2 }/n 1−α ) for appropriately chosen λ. In fact, we can improve on this rate for KPCR. The next proposition implies that the risk of KPCR matches the parametric rate n −1 for f † ∈ H • J ; the proof requires a different argument, based on eigenvalue perturbation theory (found in the Appendices).

Simulated data
This simulation study shows how KPCR is able to adapt to highly structured signals, while KRR requires more favorable structure from the ambient RKHS.
To computef λ , we use the discrete kernel K(x,x) = 1{x =x}. For each x ∈ X , the estimate of f † (x) is based on an empirical average; regularization via KRR and KPCR has the effect of shrinking the empirical estimate of f † (x) towards zero.
Using an iid sample of size n = 2 13 , we computef λ (either KRR or KPCR) for λ in a discrete grid of 2 10 values uniformly spaced between 10 −5 and 0.02, and then choose the value of λ for whichf λ has smallest validation mean- Figure 1(b) has roughly the same shape, just shifted down by σ 2 = 1/4. The selected λ isλ KRR = 0.001534 for KRR, andλ KPCR = 0.001573 for KPCR. These choices of λ yield the final estimators,f KRR,λKRR andf KPCR,λKPCR ; the ρ X -MSE is 0.0034 for KRR, and 0.0003 for KPCR. In Figure 1(c), we plot the functionsf KRR,λKRR andf KPCR,λKPCR to show the effect of regularization. The KRR function is non-zero for much of the domain, while the KPCR function is zero for nearly all of the domain (like f † ).

The KRR function is non-zero for much of the domain, while the KPCR function is zero for nearly all of the domain (like
We repeat the above simulation for different marginal distributions ρ X (x) ∝ x −α , for 1/2 ≤ α ≤ 2, which imply different eigenvalue sequences {t 2 j }. The mean and standard deviation of the ρ X -MSE's over 10 repetitions are shown in Figure 1(d). This confirms KPCR's to adapt to the regularity of f † , regardless of the ambient RKHS; KRR requires more structure to achieve similar results.

Real data
We also compared KRR and KPCR using three "weighted degree" kernels designed for recognizing splice sites in genetic sequences (Sonnenburg, 2008). 3 The 3300 samples are divided into a training set (1000), validation set (1100), and testing set (1200). For each kernel, we use the training data to computef λ for λ in a discrete grid of 2 10 equally-spaced values between 10 −5 and 0.4, and select the value of λ on which the MSE off λ on the validation set is smallest. The MSE on the testing set and the intrinsic dimension dλ for the selectedλ (on the training data) are as follows:

Discussion
Our unified analysis for a general class of regularization families in nonparametric regression highlights two important statistical properties. First, the results show minimax optimality for this general class in several commonly studied settings, which was only previously established for specific regularization methods. Second, the results demonstrate the adaptivity of certain regularization families to subspaces of the RKHS, showing that these techniques may take advantage of additional smoothness properties that the signal may possess. It is notable that the most well-studied family, KRR/Tikhonov regularization, does not possess this adaptability property.

Appendix A: Proof of Theorem 1
To provide some intuition behindf λ and our proof strategy, consider the operator T defined in (4), as applied to an element f ∈ H, Observe that T is a "population" version of the operator T X , defined in (8).
Unlike T X , T often has infinite rank; however, we still might expect that T ≈ T X for large n, where the approximation holds in some suitable sense.
We also have a large-n approximation for S * X y. For f ∈ H, where we have used the identity φ j = t j ψ j to obtain the last equality. It follows that S * X y ≈ T f † . Hence, to recover f † from y, it would be natural to apply the inverse of T to S * X y. However, T is not invertible whenever it has infinite rank, and regularization becomes necessary. We thus arrive at the chain of approximations which help motivatef λ : where g λ (T ) may be viewed as an approximate inverse for a suitably chosen regularization parameter λ.

A.1. Bias-variance decomposition
The proof of Theorem 1 is based on a simple bias-variance decomposition of the risk off λ . Let = ( where B ρ (f λ ) and V ρ (f λ ) are defined as Our proof separately bounds the bias B ρ (f λ ) and variance V ρ (f λ ) terms from Proposition 2. Taken together, these bounds imply a bound on R ρ (f λ ).

A.2. Translation to vector and matrix notation
We first note that the Hilbert space H is isometric to 2 (N) via the isometric isomorphism ι : H → 2 (N), given by ι : (we take all elements of 2 (N) to be infinite-dimensional column vectors). Using this equivalence, we can convert elements of H and linear operators on H appearing in Proposition 2 into (infinite-dimensional) vectors and matrices, respectively, which we find simpler to analyze in the sequel. Define the (infinite-dimensional) diagonal matrix Observe that Also, for 1 ≤ i ≤ n, let φ i φ i be the ∞ × ∞ matrix whose (j, j )-th entry is Finally, let I = diag(1, 1, . . . ), and let · = · 2 (N) denote the norm on 2 (N). In these matrix and vector notations, the bias-variance decomposition from Proposition 2 translates to the following: The boundedness of the kernel implies where the norms are the operator norms in 2 (N).

A.3. Probabilistic bounds
For 0 < r < 1, define the event and let A c r denote its complement. Our bounds on bias B ρ (f λ ) and variance V ρ (f λ ) are based on analysis in the event A c r (for a constant r). Therefore, we also need to show that A c r has large probability (equivalently, show that A r has small probability).
This final quantity is bounded above by 2d λ /(1 − r) on the event A c d .
Lemma 6. In the same setting as Lemma 5, Proof. The proof is based on integrating the tail bound from Lemma 5:

C.2. Differences of powers of bounded operators
Lemma 7. Let A and B be non-negative self-adjoint operators with A < 1 and B < 1. For any γ ≥ 1, Proof. The proof considers three possible cases for the value of γ: (i) γ is an integer, (ii) 1 < γ < 2, and (iii) γ is a non-integer larger than two. Case 1: γ is an integer. In this case, we have the following identity: