How isotropic kernels perform on simple invariants

We investigate how the training curve of isotropic kernel methods depends on the symmetry of the task to be learned, in several settings. (i) We consider a regression task, where the target function is a Gaussian random field that depends only on d∥ variables, fewer than the input dimension d. We compute the expected test error ϵ that follows ϵ∼p−β where p is the size of the training set. We find that β ∼ 1/d independently of d∥ , supporting previous findings that the presence of invariants does not resolve the curse of dimensionality for kernel regression. (ii) Next we consider support-vector binary classification and introduce the stripe model, where the data label depends on a single coordinate y(x_)=y(x1) , corresponding to parallel decision boundaries separating labels of different signs, and consider that there is no margin at these interfaces. We argue and confirm numerically that, for large bandwidth, β=d−1+ξ3d−3+ξ , where ξ ∈ (0, 2) is the exponent characterizing the singularity of the kernel at the origin. This estimation improves classical bounds obtainable from Rademacher complexity. In this setting there is no curse of dimensionality since β→1/3 as d→∞ . (iii) We confirm these findings for the spherical model, for which y(x_)=y(||x_||) . (iv) In the stripe model, we show that, if the data are compressed along their invariants by some factor λ (an operation believed to take place in deep networks), the test error is reduced by a factor λ−2(d−1)3d−3+ξ .


Introduction and related works
Deep neural networks are successful at a variety of tasks, yet understanding why they work remains a challenge. In particular, we do not know a priori how many data are required to learn a given rule-not even the order of magnitude. Specifically, let us denote by p the number of examples in the training set. After learning, performance is quantified by the test error ε(p). Quite remarkably, empirically one observes that ε(p) is often well fitted by a power-law decay ϵ ∼ p −β . The exponent β is found to depend on the task, on the dataset and on the learning algorithm [1,2]. General arguments would suggest that β should be extremely small-and learning thus essentially impossible-when the dimension D of the data is large, which is generally the case in practice (e.g. in images where D is the number of pixels multiplied by the number of color channels). For example in a regression task, if the only assumption on the target function is that it is Lipschitz continuous, then the test error cannot be guaranteed to decay faster than with an exponent β ∼ 1/D [3]. This curse of dimensionality [4] stems from the geometrical fact that the distance δ among nearest-neighbor data points decays extremely slowly in large d as δ ∼ p 1/D , so any interpolation method is very imprecise. The mere observation that deep learning works in large dimension implies that data are very structured [5]. Yet how to describe mathematically this structure and to build a quantitative theory for β remains a challenge. Our present goal is to study the relationship between β and symmetries in the data in simple models.
Recently there has been a considerable interest in studying the infinite-width limit of neural networks, motivated by the observation that performance generally improves with the number of parameters [6][7][8][9][10][11]. That limit depends on how the weights at initialization scale with the width. For a specific choice, similar to the LeCun initialization often used in practice, deep learning becomes equivalent to a kernel method [12], which has been coined the neural tangent kernel. In kernel methods, the learned function Z(x) is a linear combination of the functions K(x, x µ ), where x µ are the training data and K is the kernel. These methods support-vector machine is tantamount to a nearest-neighbor algorithm, which inevitably suffers from the curse of dimensionality with an exponent β ∼ 1/d. However in the limit of large σ, we provide scaling (heuristic) arguments that we systematically confirm numerically, showing that β = d−1+ξ 3d−3+ξ , where ξ is an exponent characterizing the singularity of the kernel at the origin (e.g. ξ = 1 for a Laplace kernel). This exponent β stays finite even in large dimension.
In section 4, we show that these results are not restricted to strictly flat interfaces: the same exponent β is found for the spherical model in which y(x) = y(||x||). More generally, our analysis suggests that this result will break down if the boundary separating labels shows significant variation below a length scale r c ∼ p −1/(d−1) . Avoiding the curse of dimensionality thus requires us to have an increasingly regular boundary separating labels as d increases.
Finally, in section 5, we come back to the stripe model and study how compressing the input data along its invariants (namely all the directions different from x 1 ) by a factor of λ improves performance-an effect believed to play a key role in the success of deep learning [5]. We argue and confirm empirically that, when mild, such a compression leaves the exponent β unchanged but reduces the test error by a factor of λ − 2(d−1) 3d−3+ξ .

Related works 1.2.1. Regression
The optimal worst-case performance of kernel regression has been investigated using a source condition that constrains the decay of the coefficients of the true function in the eigenbasis of the kernel [27][28][29]. For isotropic kernels and uniform data distribution, this condition is similar to controlling the decay of the Fourier components of the true function as we do here, and with our notation 2 the optimal worst-case generalization error is ϵ wc ≲ p −βwc with β wc = αT(d)−d αT that is independent of the student. Yet in our approach we average the mean-squared error on all Gaussian fields with a given covariance, leading to a typical (instead of worst-case) exponent β = 1 d min(α T (d) − d, 2α S (d)). As expected, we always have β > β wc : this follows from the fact that the exponents α T , α S must be larger than d for the kernels to be finite at the origin, a condition needed for our results to apply.

Classification
There is a long history of works computing the learning curve exponent β in regression or classification tasks where the true function or label depends on a single direction in input space, starting from the perceptron model [30] and including support vector classification [31]. More recently random feature models have received a lot of attention, and can be analytically resolved in some cases using random matrix or replica theories [9,[32][33][34]. Yet these results for classification generally consider linearly separable data 3 , and most importantly for both regression and classification tasks apply in the limit d → ∞ and p → ∞ with α = p/d fixed. In [31] for a single interface separating labels and kernels similar to ours, the learning curve of the support vector classifier was shown to decrease as ε ∼ 1/α, as also found for the perceptron [35]. Here we consider both linearly and non-linearly separable data, and take the limit of large training set size p at fixed dimension d. This is in our view warranted considering data sets commonly used as benchmarks, such as MNIST or CIFAR for which d M ∈ [15,35] and p ≈ 6 · 10 4 . In simple models for such numbers we do find that the training curves are well described by the limit we study. Specifically, the exponent β we find depends on dimension d and does not converge to 1 as d → ∞, indicating that the two limits do not commute.
Classical works on kernel classification based on Rademacher complexity lead to lower bounds on β ⩾ 1/4 [36,37] for certain algorithms applied to the stripe and spherical model 4 . Our estimation thus improves on that bound, even in the limit of large dimension where we find β = 1/3.

Kernel regression: teacher-student framework
We consider kernel ridgeless regression on Gaussian random data that present invariants. Our framework corresponds to a teacher-student setting for supervised learning [35,[38][39][40][41][42], where two variants of the same model (here kernels) are used both to generate the data and to learn them. The target function Z T (x) is 2 Specifically, this literature introduces an exponent b characterizing the decay of the eigenvalues λρ of the kernel with their rank ρ: λρ ∼ ρ −b . In our setup it is straightforward to show that b = αS/d. Another exponent c (sometimes noted 2r [29] characterizes the smoothness of the target function f ⋆ . It is defined as the largest exponent for which ⟨f ⋆ |K 1−c S f ⋆ ⟩ < ∞. It is straightforward to show that in our case, c = αT−d αS . The worst case exponent is βwc = bc bc+1 [27][28][29] and is expressed in our notations in the main text. 3 See [31] for an example of non-linearly separable data lying on a hypercube. 4 For example for a single interface, theorem 21 of [36] bounding the test error can be applied with a linear function f(x) = x1 which has a finite RKHS norm. The bound on the test error then behaves as P −1/4 . An algorithm minimizing the expression for the bound on all functions on the RKHS ball of identical norm must thus lead to β ⩾ 1/4. assumed to be a random Gaussian process N (0, K T ) with zero mean and covariance determined by a strictly positive-definite isotropic translation-invariant teacher kernel where we denote by E T the expectation over the teacher Gaussian random process 5 . Strictly positive-definiteness is required to generate such a random function.
We further assume that the function Z T (x) does not depend on all the variables x = (x 1 , . . . , x d ) T , but only on the first components generated by a Teacher kernel that has the same property, namely The (finite) training set is made up by the values of the target function Z T (x µ ) at p points {x µ } p µ=1 . Kernel (ridgeless) regression is performed with a student kernel K S (x, x ′ ), that we also take to be isotropic and translation invariant and that can be different from the teacher kernel K T (x, x ′ ). The student has no prior knowledge of the presence of invariants: its kernel is a function of all the spatial components.
Kernel regression consists in writing the prediction for the functionẐ S (x) at a generic point x as a linear combination of student kernel overlaps on the whole training set, namelŷ The vector of coefficients a is determined by minimizing the mean-squared loss on the training set: The minimization of such a quadratic loss can be carried out explicitly, and the student prediction can be written asẐ where the vector Z T ≡ (Z T (x µ )) n µ=1 contains all the samples in the training set and K µν S ≡ K S (x µ , x ν ) is the Gram matrix. By definition, the Gram matrix is always invertible for any training set if the kernel K S is strictly positive definite. The generalization error is then evaluated as the expected mean-squared error on out-of-sample data that were not used for training: numerically, it is estimated by averaging over a test set composed of p test newly sampled data points: (5) 5 With respect to the kernel literature, note that in our setting Z T never belongs to the RKHS of K T , see e.g. [44]. The conditions for it to belong to KS are discussed in [2].
This quantity is a random variable, and we take the expectation also with respect to the teacher process to define an average test error ϵ = E T ϵ T -in the numerical simulations that we discuss later, we simply average over several runs of the Teacher Gaussian process. We study how the expected test error ε decays with the size p of the training set. Asymptotically for large p, this decay follows a power law ϵ ∼ p −β . In [2], β was derived in the absence of invariants (d ∥ = d), building on results from the kriging literature [26]. It was found that β depends on three quantities: the dimension d and two exponents α T (d), α S (d) related to the two kernels. These exponents describe how the Fourier transform of the kernels decay at large frequencies:K T (w) ∼ ||w|| −αT(d) , and similarly for the student K S . Notice that, since the kernels are translation invariant, their Fourier transform is a function of only one frequency vector w. Moreover, the exponents α T (d), α S (d) depend on the dimension of the space where the Fourier transform is computed.
Our main theorem, formally presented with its proof in appendix A, is as follows: Theorem 1 (Informal) Let ε be the average mean-squared error of the regression made with a student kernel K S on the data generated by a teacher kernel K T , sampled at points taken on a regular d-dimensional square lattice in R d with fixed spacing δ. Assume that the teacher kernel only varies in a lower-dimensional space: Note 1: We expect that under broad conditions the quantity α T (d ∥ ) − d ∥ ≡ θ T (as well as θ S obviously) does not depend on d ∥ , and that θ T corresponds to the exponent characterizing the singular behavior of K T (x) at the origin: as discussed in appendix A. This fact can be shown (see below) for Laplace (where θ T = 1) and Matérn kernels whose Fourier transform can be computed exactly. Thus we recover the curse of dimensionality since β = 1 d min(θ T , 2d + 2θ S ) ≤ θ T /d, which is independent of d ∥ and thus of the presence of invariants. Note 2: A remark is in order for the case of a Gaussian kernel K(z) = exp −z 2 , since it is a smooth function and its Fourier transform (being a Gaussian function too) decays faster than any power law at high frequencies. As discussed and verified in the aforementioned paper, this theorem applies also to Gaussian kernels, provided that the corresponding exponent is taken to be θ = ∞. In particular, if the teacher is Gaussian and the student is not, β = 2 + 2θS d ; in the opposite scenario, where the teacher is not Gaussian but the student is, β = θT d ; if both kernels are Gaussian, β = ∞ and the test error decays with respect to the training set size faster than a power law.

Interpretation:
The following interpretation can be given for theorem 1 when α S is large, leading to β = θT d . An isotropic kernel corresponds to a Gaussian prior on the Fourier coefficients of the true function being learned, a prior whose magnitude decreases with wave vectors as characterized by the exponent α S . Clearly, the number of coefficients that can be correctly reconstructed cannot be larger than the number of observations p. For large α S , we find that kernel regression does indeed reconstruct well a number of the order of p first Fourier coefficients, which corresponds to wave vectors w of norm ||w|| ≤ 1/δ ∼ p 1/d . Fourier coefficients of larger wave vectors cannot be reconstructed, however, and the mean-squared error is then simply of order of the sum of the squares of these coefficients:

Numerical test:
We now test numerically that kernel regression is blind to the lower-dimensional nature of the task. We consider a d = D − 1-dimensional sphere of unit radius S d embedded in R D . To test robustness with respect to our technical assumption of data points lying on an infinite lattice, we consider instead p i.i.d. points sampled uniformly at random. The component x µ i of each point is generated as a standard Gaussian N (0, 1) and then the vector x µ is normalized by dividing it by its norm. Points belonging to such a training set have a typical nearest-neighbor distance δ ∼ p −1/D , and we will show that the test error decays with the predicted scaling ϵ ∼ δ βD = p −β . For the numerical verification we take the student to be a Laplace kernel: that is characterized by α(d) = d + θ S with θ S = 1. As teacher we use Matérn kernels, which are a family of kernels parametrized by one parameter ν: where K ν (z) is the modified Bessel function of the second kind with parameter ν, and Γ is the Gamma function. Varying ν, one can change the smoothness of the instances of the Gaussian random process, and in particular α T (d) = d + θ T with θ T = 2ν. The spatial dimension is D = 4 and we vary the number of invariants in the task by taking d ∥ = 1, 2, 3. In order to fix d ∥ we simply use z = ||x ∥ − x ′ ∥ || instead of z = ||x − x ′ || when computing the Teacher kernel. The scale of the kernel is fixed by the constant σ, that we have taken equal to 4 for both the teacher and the student. Notice that in theorem 1 the value of σ does not play any role since it does not enter the asymptotic behavior of the test error (at leading order). In figure 2 we show that the numerical simulations match our predictions. Indeed, in this specific case the predicted exponent is Notice that the exponent that characterizes the learning curves is indeed independent of d ∥ . Its prefactor may however depend on d ∥ in general.

Stripe model
We consider a binary classification task where the labels depend only on one direction in the data space, namely with y(x) = y(x 1 ). Layers of y =+1 and y =−1 regions alternate along the direction x 1 , separated by parallel planes. Two examples of this setting are sketched in figure 3, corresponding to a single and a double interface. The points x that constitute the training and test set are i.i.d. of distribution ρ(x). To lighten the notation, we assume that ρ(x) is uniform on a square box Ω of linear extension γ. Yet we expect our arguments to apply more generally if ρ(x) is continuous and does not vanish at the location of the interfaces (no margin). To confirm this view we will test and confirm below our predictions when ρ(x) is Gaussian distributed, with each component x i ∼ N (0, γ 2 ) with some variance γ 2 .

Definition of margin SVC
In this section we consider margin support-vector classification (margin SVC). This algorithm maximizes the margin between a decision boundary and the points in the training set that are closest to it. The prediction of the labelŷ(x) of a new point x is then made according to the sign of the estimated decision function [43]: where the kernel K is conditionally strictly positive definite [45]-a condition defined in appendix C, less stringent than strictly positive definite. In equation (12) we write explicitly the kernel bandwidth σ since it will soon play an important role. The formulation of the margin-SVC algorithm presented below is what is referred to as the dual formulation, but it can be equivalently recast as an attempt to maximize a (signed) distance between training points and the decision boundary [43]. In this dual formulation, the variables α µ are fixed by maximizing subject to the constraints The bias b is set to satisfy Equation (15) states that a dual variable α µ is strictly positive if and only if its associated vector x µ lies on the margin, that is y µ f(x µ ) = 1, otherwise it is zero. Vectors with α µ > 0 are called support vectors (SVs) and are the only ones that enter the expansion of the decision function equation (12).

Some limiting cases of SVC
Vanishing bandwidth: If the kernel function K(z) decreases exponentially fast with some power of z, then in the limit σ ≪ δ, where δ is the average nearest-neighbor distance in the training set, the support-vector machine becomes akin to a nearest-neighbor algorithm. A detailed analysis of this regime for the stripe model is presented in appendix B; here we provide a qualitative argument assuming that the bias b is negligible. If so, as σ → 0 one has for any training point that closest SV. The classification error is susceptible to the curse of dimensionality for such an algorithm, and one expects generically ε ∼ δ ∼ p −1/d , as tested numerically in figure B1 for the stripe model.

Diverging bandwidth:
In this work we focus on the other extreme case where the bandwidth is larger than the system size, namely σ ≫ γ. In this regime the kernel is always evaluated close to the origin. Assuming that the kernel has a finite derivative in the neighborhood of the origin, we approximate it by its truncated Taylor expansion: The exponent ξ is related to the exponent θ introduced in section 2 by ξ = min(θ, 2), and varies from kernel to kernel. For instance, we have ξ = 1 for Laplace kernels, ξ = 2 for Gaussian kernels, ξ =γ forγ-exponential kernels 6 and ξ = min(2ν, 2) for Matérn kernels. In appendix C we show that for 0 < ξ < 2 the right-hand side is conditionally strictly positive definite (CSPD), which is the necessary condition for the SVC algorithm to converge. In what follows, we consider 0 < ξ < 2, which excludes the Gaussian case. A proof that in that case the margin-SVC algorithm with the truncated kernel in equation (18) leads to the same solution as with the full kernel in the limit σ ≫ γ is presented in appendix D. Also, due to the charge conservation in equation (16), the constant term K(0) in equation (18) may safely be ignored. The decision function equation (12) associated with the considered radial power kernel hence becomes where the positive constant in equation (18) has been removed by rescaling the bias and the α µ .

Single interface
We consider a single interface at location x 1 = 0, with negative labels for x 1 < 0 and positive ones for x 1 > 0. Already in this case, computing analytically the test error remains a challenge, and we resort to a scaling (asymptotic) analysis to compute β. As p increases, SVs will be present on a narrower and narrower band around the interface. We denote by ∆ the characteristic extension of this band. ∆ will depend in general on the position x ⊥ along the interface. Here we will not study this dependence, as we are interested on its asymptotic behavior with p, γ and σ and only track how quantities depend on these variables. From the canonical condition equation (17) of SVs we have that the function f varies of order one from one side of the band to the other: where e 1 is the unit vector orthogonal to the interface and x ⊥ is any vector parallel to the plane. 6 We useγ to distinguish it from the variance of the data points.
Another useful quantity is the distance r c between nearest SVs. It can be estimated by counting the number of points lying within a cylinder of height ∆ (along x 1 ) and radius r c centered on an SV, whose volume follows ∼ ∆r d−1 c . Using that the density of data points is ∼ p/γ d , and imposing that the cylinder contains only one additional SV, yields our first scaling relation: Finally, the last scaling relation results from the function fluctuations being of order one within the band of SVs when moving parallel to the true boundary decision. Indeed, we shall show below that the function gradient along e 1 is constant at leading order in ∆, and of order 1/∆ following equation (20). Then the facts that (i) on each SV the function is fixed by f(x µ ) = y µ and (ii) the distance of the SV with respect to the true boundary fluctuates by a characteristic distance ∆ jointly imply that the fluctuations of f(x µ ) as x µ evolves along the true boundary decision must be of the order of unity. This effect is illustrated in figure 4. The characteristic transverse displacement along which these fluctuations decorrelate is simply the distance among SV r c , thus where e ⊥ is any unit vector parallel to the plane. Due to these fluctuations, test points inside the band have a finite probability to be incorrectly classified, and at fixed d 7 the test error must be proportional to the fraction ∆/γ of points falling in that band: We now show that from these considerations alone β can be computed. Starting from equation (19) we estimate the gradient of f along the normal direction e 1 at any point on the interface: where the sum is over all SVs x µ indicated by the set Ω ∆ . The sum is replaced by its central-limit theorem value valid for large p, and we use that the number of terms in that sum goes as p∆/γ. The average in equation (23) scales asᾱ∆γ ξ−2 , whereᾱ is the mean value of the dual variables α µ . Imposing that ∆∂ x1 f(x ⊥ ) ∼ 1 as follows from equation (20) then leads to our second scaling relation: Next we compute the consequences of equation (22), by recasting it in a more suitable format. We define a smoothed functionf(x ⊥ ) of f(x ⊥ ) on a scale r c : where the function G is the Fourier Thusf(x ⊥ ) is obtained by removing from f(x ⊥ ) the Fourier components ||k ⊥ || > 1/r c . The constraint of equation (22) is equivalent to imposing that the fluctuations between f andf are of the order of unity. Integrated on space this means that that can be Fourier transformed aŝ The Fourier transform of the decision function along the transverse components can be computed as Using that ||x µ − x ⊥ || ≈ ||x µ ⊥ − x ⊥ || and changing variables one obtains where we have defined the kernel (transverse) Fourier transformK ⊥ (k ⊥ ) and the 'charge' structure factor Q (k ⊥ ). The former can be readily computed for Laplace and Matérn kernels, and at large frequencies it behaves asK 1+ξ) . Concerning the charge structure factor, for ||k ⊥ || ≫ 1/r c , the phases associated with each term in the sum defining it vary significantly even between neighboring SVs. From a central-limit argument the factorQ then tends to a random variable with 0 mean and variancē α 2 p∆/γ. It is verified in appendix E.
We can now estimate the integral in equation (28): The condition equation (28) leads to the last scaling relation: Putting all the scaling relations together we find And consequently the asymptotic behavior of the test error is given by Note 1: The second scaling argument leading to equation (32) can be readily obtained by making a 'minimal-disturbance hypothesis' . Assuming that adding a new training point x * within the domain Ω ∆ will only affect the dual variables of the few closest SVs, the correction of the decision function on the new SV is given by where dα µ is the charge correction. One must have that ||x µ −x * ||≤rc dα µ y µ ≈ −y * α * to ensure that SVs further away are not affected by this perturbation. Thus dα µ ∼ α * ∼ᾱ, where the last equivalence stems from the fact that the added SV is statistically identical to any other one. Finally, requiring that the new point x * must also be an SV implies that the correction represented by equation (35) must be of the order of unity to set |f(x * )| = 1. Hence, we obtain the scaling relation (that implies equation (32) from equations (21) and (24)):ᾱ Note 2: The above scaling arguments may also be carried out in the intermediate regime δ ≪ σ < γ. In that case, the kernel equation (18) introduces a cutoff to the volume of interaction in the transverse space. In particular, the number of terms in the sum of equation (23) now goes as (σ/γ) d−1 p∆/γ and the average scales asᾱ∆σ ξ−2 . The discussion on the fluctuations is however unaltered as r c ≪ σ by definition. Assembling all the pieces yields the following scaling relations: and Note that when this approach breaks down, namely when σ~r c , the predictions of the vanishing bandwidth are recovered.

Multiple interfaces
The scaling analysis considered for the single interface can be directly extended to multiple interfaces. Let us consider the setup of n interfaces separated by a distance w. Because the target function oscillates around the n interfaces, its reproducing kernel Hilbert space (RKHS) norm increases with n leading to a more and more complicated task. In the limit ∆ ≪ w, the arguments presented between equations (25) and (32) that rely on local considerations apply identically. The computation of the gradient is more subtle as the charges will in general differ in magnitude on each side of interfaces. We discuss in appendix F how the resulting gradient will scale with w. In particular, we identify three regimes on the (n, d)-plane as represented in figure 5. When the dimension is large enough, in the green region, the gradient is dominated by points with large transverse distance, ||x ⊥ || ≫ w. For smaller dimensions, the typical transverse distance decreases so that, in the blue region, the gradient is dominated by points of transverse distance ||x ⊥ || ∼ w. For even smaller dimensions, in the gray region, our description breaks down, because the SVC function is not sufficiently smooth and microscopic effect should be accounted for. The power laws of the three usual observables are shown to be with The scaling in p is unaltered by the presence of multiple interfaces. However, the increasing complexity of the task is reflected by the large prefactor, which requires exponentially more training points to enter the power-law decay as the width w decreases. Note that, for a given dimension, the task complexity, quantified by s(d − 1)/(3d − 3 + ξ), stops increasing once n is large enough to enter the blue region.

Numerical results
In this section, we present the numerical simulations with which we verify the scalings predicted in the two previous sections. Both the single-and the double-interface setups have been considered with data points sampled from an isotropic Gaussian distribution of variance γ 2 = 1 along each component. In the single-interface setup the hyperplane is centered at x 1 = 0, while in the double-interface setup one hyperplane is located at x min = −0.3 and the other at x max ≈ 1.18549 8 . In both setups, the probabilities of positive and negative labels are equal. The margin-SVC algorithm is run using the class svm.SVC from the python library scikitlearn, which is a soft margin algorithm. To recover the hard margin algorithm presented in section 3.2, the regularization parameter C which bounds the dual variables from above (see for example chapter 7 of [45] is set to C = 10 20 . All results presented in this section have been obtained with the Laplace kernel of bandwidth σ = 100 ≫ γ. Further results with the Matérn kernel are displayed in appendix G. 8 The value xmax = √ 2 erf −1 (1 + erf(x min )) ≈ 1.18549 is chosen in such a way that the expected number of y =±1 points is the same.   The power-law predictions of section 3.4 are verified in figure 6 (for the single interface) and figure 7 (for the double interface). The considered numerical observables are defined as follows: the test error is the fraction of mislabeled points in a test set of size p test = 10 000; the typicalᾱ is the average SV dual variable; the band thickness ∆ is the average distance of a SV to the closest interface; the procedure to estimate the SV nearest-neighbor scale r c is described in appendix H. The exponents of the power laws are extracted by fitting  the numerical curves in the asymptotic regime and compared to the theoretical predictions of section 3.4 in figure 8. Note that in large dimensions we observe that the system has not yet fully reached the asymptotic regime in the considered range of training-set sizes p.
We also observe that in the double-interface setup the system only enters the scaling regime when ∆ becomes small enough compared to the distance w between the two hyperplanes, as discussed in section 3.5. The crossover from the interfering-interface regime to the asymptotic regime is illustrated in figure 9. The test error versus ∆ displayed in the left panel for multiple values of w confirms that ε ∼ ∆ when ∆ ≪ w, as expected from the discussion of section 3.5. We show in the right panel that the transition to the asymptotic regime occurs when ∆ ∼ w by rescaling the horizontal axis: ∆ → ∆/w. Because ε ∼ ∆ in the asymptotic regime, it is necessary to also rescale the vertical axis for the curves to collapse, namely ε → ε/w.

Spherical model
We consider a spherical interface separating y =+1 points outside a sphere of radius R from y =−1 points inside. The relevant direction is therefore x ∥ = ||x||, and the label is given by y(x) = sign(||x|| − R). We still assume that the SV are distributed along the interface, thus forming a shell of radius R and thickness ∆. Once again, previous arguments presented between equations (25) and (32) that rely on local considerations apply identically. Furthermore, we compute in appendix I the gradient ∂f/∂x ∥ and find again the same asymptotic result as for the planar interface specified in equation (23). Thus our predictions for the spherical model are identical to the ones for the stripe model.
We test these results numerically for a sphere of radius R = √ d 9 with a Laplace kernel of variance σ = 100. The results displayed in figures 11 and 12 confirm our analysis.    figure 11 for the spherical setup. We then plot the exponents for the SV band thickness ∆ (left), the SV nearest-neighbor scale rc (middle) and the SV mean dual variableᾱ (right) against the dimension d of the data. The black solid line is the prediction of section 3.4, while the dots correspond to the numerical data (blue points for the single-interface setup and orange points for the double-interface setup).

Improving kernel performance by compressing invariants
In this section, we investigate how compressing the data along the irrelevant directions x ⊥ affects the performance of kernel classification. This analysis is of particular interest for neural networks, where it is now argued (see for instance [5]) that a progressive capability to compress invariants in the data is built up moving through the layers of deep networks.

Stripe model
We consider the stripe model of section 3 with one additional parameter: the amplification factor λ. If the original distribution was characterized by the scales γ 1 , . . . , γ d along each space direction, we now apply a contraction in the transverse space: γ i → γ i /λ for i = 2, . . . , d. Following the same reasoning as in section 3.4, we can track the effect of the additional amplification parameter. It is not sufficient to merely rescale γ, since the compression is not isotropic. Nevertheless, it is easy to see that the first scaling becomes since the density of points inside the SV band is now ∼ pλ d−1 /γ d . Then, for the second scaling relation, we need to rescale the gradient ∂ x1 f defined in equation (23). The amplification factors only alter the transverse space: when approximating the average by an integral, the boundaries are rescaled to γ/λ in each transverse direction. The second scaling is thus Finally, when imposing that the fluctuations between f and its smoothed versionf are of the order of unity, one only needs to update the volume of the transverse space in equation (27): γ d−1 → (γ/λ) d−1 , which leads to the last scaling, Assembling all the scaling relations yields These power laws are assessed numerically for the Laplace kernel (ξ = 1) of variance σ = 100 and a training set of size p = 1000 generated from the Gaussian distribution of variance γ 2 = 1. Varying the amplification factor over eight orders of magnitude (see figure 13), our predictions hold in a broad range of λ but break down at large and small values, as we now explain.
In the limit λ → 0, the relevant direction x 1 is negligibly small compared to the other directions; information is thus suppressed and points are classified at random: the test error goes to 1/2. Furthermore, all training points must be SVs, and indeed ∆ → ⟨|x|⟩ x∼N (0,1) = 2/π (which is the average distance from any point in the dataset to the interface) andᾱ → 1.
In the opposite limit λ → ∞ the setup lives in dimension one (seeing only x 1 ) and all curves converge independently of the space dimension d. These relations allow us to identify a critical scale λ c at which the multidimensional system reduces effectively to a one-dimensional system. It occurs when the test error of the compressed multidimensional kernel is equal to the test error of the kernel that only sees the component x 1 .
Using our scalings, we find

Cylinder model
We now consider a cylinder model in has a positive label if ||x ∥ || > R ∼ γ and a negative label otherwise. Such a model is also characterized by the asymptotic scalings in p specified in equation (23). As in the previous section, we compress the perpendicular directions by the amplification factor λ: The derivations of the scaling relations equations (42) and (44) hold equally. However, the scaling relation equation (43) is now independent of the amplification factor: the characteristic size of the transverse space occurring in the gradient integral equation (23) remains of the order of the system size γ. Assembling the different scalings yields

Conclusion
We have studied the learning curve exponent β of an isotropic kernel in the presence of invariants, improving on worst case bounds previously obtained in the literature. For regression on Gaussian fields, we find that invariants do not increase β, which behaves as ∼ d −1 in large dimension: methods based on isotropic kernels suffer from the curse of dimensionality, as already argued in [4]. Our analysis also suggests a simple estimate (8) for the performance of regression beyond the Gaussian fields considered here. For a binary classification and simple models of invariants we find the opposite result. For a planar interface separating labels, β ⩾ 1/3 for all dimensions, improving on previous bounds. Note that the striking difference between classification and regression does not stem from the distinct models considered in each case. Indeed, following equation (8) we expect that performing mean-square ridgeless regression on the stripe model leads to the curse of dimensionality with β = 1/d, as we have checked on a few examples (data not shown). In the classification problem instead, due to the fact that only a tiny band of data are SVs, the output function ends up being much smoother (i.e. with more rapidly decaying Fourier components) than a step function, leading to better performance.
This success of classification holds when several interfaces are present, or in the spherical case where the interface continuously bends. Thus, isotropic kernels can beat the curse of dimensionality even for non-planar boundaries between labels. For which class of boundaries is this result true? The geometry of the spatial distribution of SVs suggests an intuitive answer. The curse of dimensionality is beaten because a very narrow (i.e. rapidly decaying with p) layer of width ∆ is sufficient to fit all data, despite the fact that the distance between SVs r c is much larger (and indeed subjected to the curse of dimensionality). Thus if the boundary displays significant variations below the scale r c , it presumably cannot be detected by isotropic kernels. In this view, beating the curse of dimensionality is only possible if the boundary becomes more and more regular as the dimension increases. This geometrical view is consistent with the more abstract kernel literature in which the curse is lifted if labels correspond to the sign of a regular function (in the sense of belonging to the RKHS of the kernel) [36]. Empirically, sufficient regularity may be achieved in practical settings at least along some invariants, such as completely uninformative pixels near the boundaries of images. Under which conditions other invariants, e.g. related to translation, can be exploited by isotropic kernels remains to be understood.
Note added: In [46], these results were extended beyond kernels, to the case of a wide one-hidden layer net. In the lazy training regime, results are identical to those presented here, but more favorable exponents β are found in the feature learning regime.

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: https://gitlab. com/jonas.paccolat/svc-for-simple-invariants.

Appendix A. Kernel regression with invariant dimensions
Theorem. Let K T (x) and K S (x) be two translation-invariant kernels (called the teacher and student respectively) defined on V d ≡ R d , and letK T (w) andK S (w) be their Fourier transforms in V d . Assume that • K T (x), K S (x) are continuous everywhere and differentiable everywhere except at the origin x = 0; • K T (x) and K S (x) are positive definite and isotropic, that is, they only depend on ||x||; • K T (x) and K S (x) have a cusp at the origin and their d-dimensional Fourier transform decays at high frequencies with dimensional-dependent exponents α T (d ∥ ) and α S (d), respectively (we will evaluate them at d ∥ for the teacher and at d for the student); Assume furthermore that the teacher kernel lives in a reduced space of dimension d ∥ ≤ d, in the sense that We use the teacher kernel to sample a Gaussian random field Z T (x) ∼ N (0, K T ) at points that lie on a d-dimensional regular lattice in V d , with fixed spacing δ, and we use the student kernel to inferẐ S (x) at a new point x ∈ V d via regression, and performance is then evaluated by computing the expected mean-squared error on points independent from those used for training. Then, as δ → 0, We first consider a finite number of points p in a box V d = [−L/2, L/2] d and then take the limit p, L → ∞, keeping the spacing δ = Lp −1/d fixed. Regression is done by minimizing the mean-squared error on the p points: and the generalization error is defined as (A3) (The expectation value is taken with respect to the teacher random process.) Given a function F(x) on the d-dimensional box V d = [−L/2, L/2] d , we denote its Fourier transform (series) and antitransform bỹ Given the structure of the teacher kernel we can writẽ This formula states that the Fourier transform of the teacher kernel has frequencies that also live in the corresponding d ∥ -dimensional subspace in the frequency domain. The term δ w ⊥ is a discrete delta (not a Dirac delta): this will be important later because it implies that it is scale invariant: δ aw ⊥ = δ w ⊥ . The first term, that is the Fourier transform of the teacher kernel restricted to the d ∥ -dimensional space, decays at large frequencies with an exponent α T (d ∥ ) that depends on the intrinsic dimension d ∥ : The solution to the regression problem can be computed in closed form: where Z T = Z T (x µ ) p µ=1 are the training data (the points x µ lie on the regular lattice), is the Gram matrix, that is invertible since the kernel K S is assumed to be positive definite. This formula can be written in Fourier space as where we have defined F ⋆ (w) ≡ n∈Z d F w + 2πn δ for a generic function F. The mean-squared error can then be written using the Parseval-Plancherel identity. After some calculations we find where L d = 2π L Z d and B d = − π δ , π δ d is the Brillouin zone.
In order to simplify this expression in the case where d ∥ ≤ d, let us also introduce Plugging the last two equations into equation (A10) we see that, because of the terms δ w ⊥ , we have Notice thatK ⋆ S and [K 2 S ] ⋆ do not turn into [K S ] ⋆ ∥ and [K 2 S ] ⋆ ∥ : this is because the student kernel does not has the same invariants as the teacher, and it depends on all the components. Here (iii) Expansion.
Using the high-frequency behavior of the Fourier transforms of the two kernels we can writẽ We have introduced the functions , ψ αS (0) are finite. Furthermore, the w ∥ 's in the sums are at most of order δ −1 , therefore the terms ψ α (wδ) are δ 0 and do not influence how equation (A10) scales with δ.
Expanding equation (A10) and keeping only the highest orders we find We have neglected terms proportional to, for instance, δ αT(d ∥ )+αS , since they are subleading with respect to δ αT(d ∥ ) , but we must keep both δ αT(d ∥ ) and δ αS since we do not know a priori which one is dominant. The additional term δ −d in the subleading terms comes from the fact that |L ∩ B| ∼ δ −d .
The first term in equation (A20) is the simplest to deal with: since ||w ∥ δ|| is smaller than some constant for all w ∥ ∈ L ∥ ∩ B ∥ and the function ψ ∥ αT(d ∥ ) (w ∥ δ) has a finite limit, we have We then split the second term in equation (A20) in two contributions.
Small||w ∥ ||. We consider 'small' all the terms w ∥ ∈ L ∥ ∩ B ∥ such that ||w ∥ || < Γ, where Γ ≫ 1 is of order δ 0 but large. As δ → 0, ψ 2αS (w ∥ δ) → ψ 2αS (0) which is finite because K S (0) < ∞. Therefore δ 2αS The summand is real and strictly positive because the positive definiteness of the kernels implies that their Fourier transforms are strictly positive. Moreover, as δ → 0, which contains a finite number of elements, independent of δ. Therefore δ 2αS Large||w||. 'Large' w are those with ||w|| > Γ: we recall that Γ ≫ 1 is of order δ 0 but large. This allows us to approximateK ∥ T ,K S in the sum with their asymptotic behavior: Therefore in the end The kernels K that we consider in the present article, namely Laplace and Matérn, share the property that the respective exponents take the form α K (d) = d + θ K , θ K being a dimension-independent constant that only depends on the isotropic function that defines the kernel. For instance, we have α = d + 1 for Laplace and α(d) = d + 2ν for Matérn (with parameter ν). Consequently, for these kernels the term α(d ∥ ) − d ∥ that appears in the last equation is actually independent of d ∥ , and therefore so is the exponent β. We believe that this structure of the exponent α(d) is more general. Signals that point in this direction can be found in several papers. In [47] they show that (with our notation), for functions K (||x||) that are integrable in R d and R d+2 , These results offer a link between the exponents in different dimensions. In [49] the author computes the asymptotic behavior of the one-dimensional Fourier transform of functions with a singularity. In particular, it follows that if K(x) = |x| θK K ∞ (x), with −1 < θ K ≤ 0 and K ∞ ∈ C ∞ (R), then its Fourier transform at the leading order decays with an exponent α(d = 1) = 1 + θ K . There is a similarity with the value of the exponents for the Laplace and Matérn kernels that we use: the value of θ K is linked to the exponent of the cusp |x| θK that appears in the Taylor expansion of the Kernel at the origin. We expect that this fact, namely that the exponent α K (d) is the sum of spatial dimension d and of the cusp exponent θ K , is more generic and applies to most of the kernels that are used in practice.

Appendix B. Regime σ ≪ δ: curse of dimensionality
We consider here the case where the kernel bandwidth σ is much smaller than the nearest-neighbor distance δ. In this limit the contributions in the expansion of the decision boundary in equation (12) are significantly suppressed because the kernel is supposed to decay when its argument is large, and the decision boundary is dominated by the charge of training pattern x µ that is closest to x. The sign of the decision function is thus fixed by the sign of the nearest neighbor's charge and the accuracy is driven by the nearest neighbor distance, which is susceptible to the curse of dimensionality.
We can see this more precisely if we approximate the kernel interaction between two points x and x ′ as x ′ is one of the nearest neighbors of x , 0 otherwise.
(B1) 10 Hence, the decision function at a point x µ reads where the sum runs over the nearest neighbors of x µ . We use that all points are SV, which results from the hierarchy a 1 ≪ a 0 . Indeed, the interaction term alone is never sufficient for ||f(x µ )|| to exceed one. The second equality is justified by the following reasoning. First, in the limit δ → 0, the nearest neighbors typically share the same sign, so that all the y ν in the sum can be replaced by y µ . a ′ 1 is thus a 1 multiplied by the number of terms in the sum. Then, because the distribution is assumed smooth and the kernel is blind to the data structure coming from distant patterns, the SV charge may only depend on its label: α µ = α 0 + y µ ∆α. ∆α is taken independent of the associated label y µ , as we assume the labels to be balanced. The charge conservation equation (16) implies immediately that ∆α = −α 0 ⟨y⟩, where ⟨y⟩ = 1 p µ y µ ∼ p −1/2 and imposing the condition y µ f(x µ ) = 1 on each points x µ yields α 0 = 1/(a 0 + a ′ 1 ) and b = ⟨y⟩. We can now compute the test error of the SVC in the limit σ ≪ δ. The prediction on a test point x iŝ where with a slight abuse of notation we take the sum over the points x ν in the training set that are nearest neighbors of the test point x, and y NN is their label (as before, assumed to be constant among nearest neighbors). We observe two distinct behaviors according to the ratio between the bias b = ⟨y⟩ and the nearest-neighbor contribution a ′ 1 . If ⟨y⟩ ∼ p −1/2 is much larger than a ′ 1 , the above prediction yieldŝ y(x) = sign ⟨y⟩ (for any x): this estimator cannot beat a 50% accuracy. In contrast, if ⟨y⟩ is much smaller than a ′ 1 , the prediction yieldsŷ(x) = sign(y NN ): the classifier acts as a nearest-neighbor algorithm, and consequently its test error scales as the nearest-neighbor distance, ε~δ~p −1/d -namely, it is susceptible to the curse of dimensionality-as we show in figure B1.  As for the linear interface, the first scaling relation stems from the condition ∆ · ∂ x ∥ f(x ⋆ ) ∼ 1, for any x ⋆ lying on the spherical interface. According to the change of frame introduced above, the relevant direction correspond to the first coordinate, namely x ∥ = x 1 . The gradient expression (23) can thus be expressed as an integral in spherical coordinate with the north pole x ⋆ = (R, 0): −∆ du (R + u) d−1ˆπ 0 dϕ sin d−2 ϕρ(R + u)α(R + u)y(R + u)I(u, ϕ), where the vector of integration norm is r = R + u and its angle with respect to the north pole is ϕ. All other angles simply integrate to the (d − 2)-sphere surface, S d − 2 , since they do not contribute to the integrand with a 0 (ϕ) = 1 2R 2R 2 (1 − cos ϕ) ξ/2 and a 1 (ϕ) = 1 − ξ 2 (1 − cos ϕ) 2 R 2 (1 − cos ϕ) The leading order contribution a 0 vanishes because of the charge conservation (equation (I2)), so the gradient reads and the second scaling relation pᾱ∆ 3 ∼ 1 is identical as for the stripe model. Since the other relations are obtained from local arguments, they are independent of the global shape of the classification task.