Locality defeats the curse of dimensionality in convolutional teacher-student scenarios

Convolutional neural networks perform a local and translationally-invariant treatment of the data: quantifying which of these two aspects is central to their success remains a challenge. We study this problem within a teacher-student framework for kernel regression, using `convolutional' kernels inspired by the neural tangent kernel of simple convolutional architectures of given filter size. Using heuristic methods from physics, we find in the ridgeless case that locality is key in determining the learning curve exponent $\beta$ (that relates the test error $\epsilon_t\sim P^{-\beta}$ to the size of the training set $P$), whereas translational invariance is not. In particular, if the filter size of the teacher $t$ is smaller than that of the student $s$, $\beta$ is a function of $s$ only and does not depend on the input dimension. We confirm our predictions on $\beta$ empirically. We conclude by proving, using a natural universality assumption, that performing kernel regression with a ridge that decreases with the size of the training set leads to similar learning curve exponents to those we obtain in the ridgeless case.


Introduction
Deep Convolutional Neural Networks (CNNs) are widely recognised as the engine of the latest successes of deep learning methods, yet such a success is surprising. Indeed, any supervised learning model suffers in principle from the curse of dimensionality: under minimal assumptions on the function to be learnt, achieving a fixed target generalisation error requires a number of training samples P which grows exponentially with the dimensionality d of input data [1], i.e. (P ) ∼ P −1/d . Nonetheless, empirical evidence shows that the curse of dimensionality is beaten in practice [2,3,4], with (P ) ∼ P −β , β 1/d.
(1) CNNs, in particular, achieve excellent performances on high-dimensional tasks such as image classification on ImageNet with state-of-the-art architectures, for which β ≈ [0.3, 0.5] [2]. Natural data must then possess additional structures that make them learnable. A classical idea [5] ascribes the success of recognition systems to the compositionality of data, i.e. the fact that objects are made of features, themselves made of sub-features [6,7,8]. In this view, the locality of CNNs plays a key role for their performance, as supported by empirical observations [9]. Yet, there is no clear analytical understanding of the relationship between the compositionality of the data and learning curves.
In order to study this relationship quantitively, we introduce a teacher-student framework for kernel regression, where the function to be learnt takes one of the following two forms: Here, x is a d-dimensional input and x i denotes the i-th t-dimensional patch of x, x i = (x i , . . . , x i+t−1 ). i ranges in a subset P of {1, . . . , d}. The g i 's and g are random functions of t variables whose smoothness is controlled by some exponent α t . Such functions model the local nature of certain datasets and can be generated, for example, by randomly-initialised one-hidden-layer neural networks: f LC corresponds to a locally connected network (LCN) [10,11], in which the input is split into lower-dimensional patches before being processed, whereas a network enforcing invariance with respect to shifts of the input patches via weight sharing can be described by f CN . In such cases t would be the filter size of the network. Our goal is to compute the asymptotic decay of the error of a student kernel performing regression on such data, and to relate the corresponding exponent β to the locality of the target function. The student kernel corresponds to a prior on the true function of the form described by Eq. (2), except that the filter size s and its prior α s on the smoothness of the g functions can differ from those of the target function. Such students include overparametrised one-hidden-layer neural networks operating in the lazy training regime [12,13,14,15,16].

Our contributions
We consider a teacher-student framework for kernel regression, where the target function has one of the forms in Eq. (2), where the g i 's and g are Gaussian random fields of given covariance. Target functions are characterised by the dimensionality t of the g functions-the filter size-and a smoothness exponent α t , such that α t > 2n implies that typical target functions are at least n times differentiable. Kernel regression is performed by local or convolutional student kernels, having filter size s and a prior on the target smoothness characterised by another exponent α s > 0. Our main contributions follow: • We use recent results based on the replica method of statistical physics on the generalisation error of kernel methods [17,18,19] to estimate the exponent β. We find that β = α t /s if t ≤ s and α t ≤ 2(α s + s). This approach is non-rigorous, but it can be proven if data are sampled on a lattice [4] and corresponds to a provable lower bound on the error when teacher and student are equal [20]. • In particular, we find the same exponent for students with a prior on the shift invariance of the target function and students without this prior, implying that the curse of dimensionality is beaten due to locality and not shift invariance.
• We confirm systematically our predictions by performing kernel ridgeless regression numerically for various t, s and embedding dimension d. • We use the recent framework of [21] and a natural Gaussian universality assumption to prove a rigorous estimate of β in the case where the ridge decreases with the size of the training set. The estimate of β depends again on s and not on d, demonstrating that the curse of dimensionality can indeed be beaten by using local filters on such compositional data.

Related work
Several recent works study the role of the compositional structure of data [6,22,23]. When such structure is hierarchical, deep convolutional networks can be much more expressive than shallow ones [6,24,7]. Concerning training, [25] shows that both convolutional and locally-connected networks can achieve a target generalisation error in polynomial time, whereas fully-connected networks cannot, for a class of functions which depend only on s consecutive bits of the d-dimensional input, with s =O(log d). In [8] the effects of the architecture's locality are studied from a kernel perspective, using a class of deep convolutional kernels introduced in [26,27] and characterising their Reproducing Kernel Hilbert Space (RKHS). In general, belonging to the RKHS ensures favourable bounds on performance and, for isotropic kernels, is a constraint on the function smoothness that becomes stringent in large d. For local functions, the corresponding constraint on smoothness is governed by the filter size s and not d [8]. Lastly, a recent work shows that weight sharing, in the absence of locality, leads to a mild improvement of the generalisation error of shift-invariant kernels [28].
By contrast, our work focuses on computing non-trivial training curve exponents in a setup where the locality and shift-invariance priors of the kernel can differ from those of the class of functions being learnt. In our setup, the latter are in general not in the RKHS of the kernel 2 . Technically, our result that the size of the student filter s controls the learning curve (and not that of the teacher t) relates to the fact that kernels are not able to detect data anisotropy (the fact that the function depends only on a subset of the coordinates) in worst-case settings [30] nor in the typical case for Gaussian fields [31].

Setup
Kernel ridge regression Kernel ridge regression is a method to learn a target function f * : R d → R from P observations {(x µ , f * (x µ ))} P µ=1 , where the inputs x µ are i.i.d. random variables distributed according to a certain measure p d d x on R d . Let K be a positive-definite kernel and H the corresponding Reproducing Kernel Hilbert Space (RKHS). The kernel ridge regression estimator f of the target function f * is defined as where · H denotes the RKHS norm and λ is the ridge parameter. The limit λ → 0 + is known as the ridgeless case and corresponds to the solution with minimum RKHS norm that interpolates the P observations. Eq. (3) is a convex optimisation problem, having the unique solution where K P is the Gram matrix defined as (K P ) µν = K(x µ , x ν ), and I P denotes the P -dimensional identity matrix. Our goal is to compute the generalisation error, which we define as the expectation of the mean squared error over the data distribution p d d x , averaged over an ensemble of target functions f * , i.e The error depends on the number of samples P through the predictor of Eq. (4) and we refer to the graph of (P ) as learning curve.
Statistical mechanics of generalisation in kernel regression The theoretical understanding of generalisation is still an open problem. A few recent works [17,21,18] relate the generalisation error to the decomposition of the target function in the eigenbasis of the kernel. A positive-definite kernel K can indeed be written, by Mercer's theorem, in terms of its eigenvalues {λ ρ } and eigenfunctions {φ ρ }: In [17,21,18] it is shown that, when the target function can be written in terms of the kernel eigenbasis, the error can also be cast as a sum of modal contributions, = ρ ρ . The details of the general formulation are summarised in Appendix A. Here we present an intuitive limiting case, obtained in the ridgeless limit λ → 0 + , when λ ρ ∼ ρ −a for large ρ, and E[|c with ∼ denoting asymptotic equivalence for large P . Eq. (8) indicates that, given P examples, the generalisation error can be estimated as the tail sum of the power in the target function past the first P modes of the kernel, which we denote as B(P ). Although the general modal decomposition cannot be proven rigorously in the ridgeless limit [21,19], additional results are available when the target functions are Gaussian random fields with covariance specified by a teacher kernel: • Eq. (8) can be proven rigorously [4] if teacher and student are isotropic kernels and the input points x µ are sampled on the lattice Z d , i.e. all the elements of each input sequence are integer multiples of an arbitrary unit; • If teacher and student coincide then E[|c ρ | 2 ] equals the ρ-th eigenvalue λ ρ and (see e.g. [20]) (P ) ≥ B(P ), i.e. the estimate of Eq. (8) is a lower bound.

Kernels for local and convolutional teacher-student scenarios
In this section we introduce convolutional and local kernels that will be used as teachers, i.e. to generate different ensembles of target functions f * with controlled smoothness and degree of locality, and as student kernels. We motivate our choice by considering one-hidden-layer neural networks with simple local and convolutional architectures. Because of the relationship between our kernels and the Neural Tangent Kernel [12] of the aforementioned architectures, our framework encompasses regression with simple overparametrised networks trained in the lazy regime [16]. For the sake of clarity we limit the discussion to inputs which are sequences in R d , i.e. x = (x 1 , . . . , x d ). Extension to higher-order tensorial inputs such as images X ∈ R d×d is straightforward. To avoid dealing with the boundaries of the sequence we identify x i+d with x i for all i = 1, . . . , d. Definition 3.1 (one-hidden-layer CNN). A one-hidden-layer convolutional network with H hidden neurons and average pooling is defined as follows, where x ∈ R d is the input, H is the width, σ a nonlinear activation function, P ⊆ {1, . . . , d} is a set of patch indices and |P| its cardinality. For all i ∈ P, x i is an s-dimensional patch of x. For all h = 1, . . . , H, w h ∈ R s is a filter with filter size s, a h ∈ R is a scalar weight. The dot · denotes the standard Euclidean scalar product.
In the network defined above, a d-dimensional input sequence x is first mapped to s-dimensional patches x i , which are ordered subsequences of the input. Comparing each patch to a filter w h and applying the activation function σ leads to a |P|-dimensional hidden representation which is equivariant for shifts of the input. The summation over the patch index i promotes this equivariance to full invariance, leading to a model which is both local and shift-invariant as f CN in Eq. (2). A model which is only local, as f LC in Eq. (2), can be obtained by lifting the constraint of weight-sharing, which forces, for each h = 1, . . . , H, the same filter w h to apply to all patches x i . Definition 3.2 (one-hidden-layer LCN). In the notation of Definition 3.1, a one-hidden-layer locallyconnected network with H hidden neurons is defined as follows, For all i ∈ P and h = 1, . . . , H: x i is an s-dimensional patch of x, w h,i ∈ R s is a filter with filter size s, a h,i ∈ R is a scalar weight.
Notice that the definition above reduces to that of a fully-connected network when the filter size is set to the input dimension, s = d, and P = {1}. With the target functions taking one of the two forms in Eq. (2), our framework contains the case where the observations are generated by neural networks such as (3.1) and (3.2). Let us now introduce the neural tangent kernels of such architectures. Definition 3.3 (Neural Tangent Kernel). Given a neural network function f (x; θ), where θ = (θ 1 , . . . , θ N ) denotes the complete set of parameters and N the total number of parameters, the Neural Tangent Kernel (NTK) is defined as [12] where ∂ θn denotes partial derivation w.r.t. the n-th parameter θ n .
For one-hidden-layer networks with random, O(1)-variance Gaussian initialisation of all the weights, and normalisation by √ H as in (3.1) and (3.2), the NTK converges to a deterministic limit Θ(x, y) as N ∝ H → ∞ [12]. Furthermore, training f (x, θ) − f (x, θ 0 ), with θ 0 denoting the network parameters at initialisation, under gradient descent on the mean squared error is equivalent to performing ridgeless regression with kernel Θ(x, y) [12]. The following lemmas relate the NTK of convolutional and local architectures acting on d-dimensional inputs to that of a fully-connected architecture acting on s-dimensional inputs. Both lemmas are proved in Appendix B. Lemma 3.1. Call Θ F C the NTK of a fully-connected network function acting on s-dimensional inputs and Θ CN the NTK of a convolutional network function (3.1) with filter size s acting on d-dimensional inputs. Then As the functions in Eq. (2), Θ CN is written as a combination of lower-dimensional constituent kernels Θ F C acting on patches, and the dimensionality of the constituent kernel coincides with the filter size of the corresponding network. This observation extends to local kernels, via Lemma 3.2. Call Θ LC the NTK of a locally-connected network function (3.2) with filter size s acting on d-dimensional inputs. Then Following the general structure of Eq. (12) and Eq. (13), we introduce convolutional (K CN ) and local (K LC ) student and teacher kernels, defined as sums of lower-dimensional constituent kernels C, The kernels in Eq. (14) are characterised by the dimensionality of the constituent kernel C, or filter size s (for the student, or t for the teacher) and the nonanalytic behaviour of C when the two arguments approach, i.e. C(x i , y j ) ∼ x i − y j αs (for the student, or x i − y j αt for the teacher) plus analytic contributions, with α s/t = 2m for m ∈ N. Using the kernels in Eq. (14) as covariances allows us to generate random target functions with the desired degree of locality t (as in Eq. (2)), which can also be invariant for shifts of the patches. Having a student kernel as in Eq. (14) results in an estimator f also having the form displayed in Eq. (2), with a different filter size with respect to the target function. The α's control the smoothness of these functions as, if α > 2n ∈ N, then the functions are at least n times differentiable in the mean-square sense.
A notable example of such constituent kernels is the NTK of ReLU networks Θ F C , which presents a cusp at the origin corresponding to α s = 1 [32]. In addition, in the H → ∞ limit, a network initialised with random weights converges to a Gaussian process [33,34,35]. For networks with ReLU activations, the covariance kernel of such process has nonanalytic behaviour with α t = 3 [36].

Mercer's decomposition of local and convolutional kernels
We now turn to describing how the eigendecomposition of the constituent kernel C induces an eigendecomposition of convolutional and local kernels. We work under the following assumptions, i) The constituent kernel C(x, y) on R s × R s admits the following Mercer's decomposition, with (ordered) eigenvalues λ ρ and eigenfunctions φ ρ such that, with p (s) (d s x) denoting the s-dimensional patch measure, φ 1 (x) = 1 ∀x and p (s) (d s x)φ ρ (x) = 0 for all ρ >1; ii) Convolutional and local kernels from Eq. (14) have nonoverlapping patches, i.e. d is an integer multiple of s and P = {1 + n × s | n = 1, . . . , d/s} with |P| =d/s; iii) The s-dimensional marginals on patches of the d-dimensional input measure p (d) (d d x) are all identical and equal to p (s) (d s x).
We stress here that the request of nonoverlapping patches in assumption ii) can be relaxed at the price of further assumptions, i.e. C(x, y) = C(x − y) and data distributed uniformly on the torus, so that C is diagonalised in Fourier space. The resulting eigendecompositions are qualitatively similar to those described in this section (details in Appendix C). Let us also remark that assumptions i) and iii)-together with all the assumptions on the data distribution that might follow-are technical in nature and required only to carry out the Mercer's decomposition analytically. We believe that the main results of this paper hold under much more general conditions, namely the support of the distribution being truly d-dimensional-such that the distance between neighbouring points in a collection of P data points scales as P −1/d -and the distribution itself decaying rapidly away from the mean or having compact support. Our experiments, discussed in Section 5, support this hypothesis. Lemma 3.3 (Spectra of convolutional kernels). Let K CN be a convolutional kernel defined as in Eq. (14a), with a constituent kernel C satisfying assumptions i), ii) and iii) above. Then K CN admits the following Mercer's decomposition, with eigenvalues and eigenfunctions Lemma 3.4 (Spectra of local kernels). Let K LC be a local kernel defined as in Eq. (14b), with a constituent kernel C satisfying assumptions i), ii) and iii) above. Then K LC admits the following Mercer's decomposition, with eigenvalues and eigenfunctions (∀i ∈ P) Under assumptions i), ii) and iii) above, lemmas 3.3 and 3.4 follow from the definitions of convolutional and local kernels and the eigendecompositions of the constituents (see Appendix C for a proof of the lemmas and generalisation to kernels with overlapping patches). In the next section, we explore the consequences of these results for the asymptotics of learning curves.

Asymptotic learning curves for ridgeless regression
In what follows, we consider explicitly translationally-invariant constituent kernels which is uniform on the torus, so that all lower-dimensional marginals are also uniform on lower-dimensional tori. Under these conditions, all results of Section 3 can be extended to kernels with overlapping patches (P = {1, . . . , d}), so that the main results of this paper apply to nonoverlapping as well as overlapping-patches kernels. Furthermore, Mercer's decomposition Eq. (15) can be written in Fourier space [37], with s-dimensional plane waves φ (s) k (x) = e ik·x as eigenfunctions and the eigenvalues coinciding with the Fourier transform of C. Furthermore, for kernels with filter size s (or t) and positive smoothness exponent α s (or α t ), the eigenvalues decay with a power −(s + α s ) (or −(t + α t )) of the modulus of the wavevector k = √ k · k [38]. In this setting, we obtain our main result: Theorem 4.1. Let K T be a d-dimensional convolutional kernel with a translationally-invariant t-dimensional constituent and leading nonanalyticity at the origin controlled by the exponent α t > 0. Let K S be a d-dimensional convolutional or local student kernel with a translationally-invariant s-dimensional constituent, and with a nonanalyticity at the origin controlled by the exponent α s > 0. Assume, in addition, that if the kernels have overlapping patches then s ≥ t, whereas if the kernels have nonoverlapping patches s is an integer multiple of t; and that data are uniformly distributed on a d-dimensional torus. Then, the following asymptotic equivalence holds in the limit P → ∞, Theorem 4.1, together with Eq. (8) and the additional assumption α t ≤ 2(α s + s), yields the following expression for the learning curves asymptotics, As β is independent of the embedding dimension d, we conclude that the curse of dimensionality is beaten when a convolutional target is learnt with a convolutional or local kernel. In fact, Eq. (20) indicates that there is no asymptotic advantage in using a convolutional rather than local student when learning a convolutional task, confirming the picture that locality, not weight sharing, is the main source of the convolutional architecture's performances [6]. In Appendix D we show that the generalization error of a local student learning convolutional teacher decays as Eq. (21) implies that including weight sharing only amounts to a rescaling of P by a factor |P|-the size of the translation group over patches-recovering the result obtained in [28]. Intuitively, a local student will need |P| times more points than a convolutional student to learn the target with comparable accuracy, since it has to learn the same local function in all the possible |P| locations. The predictions in Eq. (20) and Eq. (21) are confirmed empirically, as discussed in Section 5 and Appendix G. Let us mention in particular that, although our predictions are valid only asymptotically, they hold already in the range P ∼ 10 2 − 10 3 , consistently with the number of training points typically used in applications.
Theorem 4.1 is proven in Appendix D and extended to the case of a local teacher and local student in Appendix E. Here we sketch the proof for the nonoverlapping case, which begins with the calculation of the variance of the coefficients of the target function in the student basis. By indexing the coefficients with the s-dimensional wavevectors k, If the size of teacher and student coincide, s = t, teacher and student have the same eigenfunctions. Thus, using the eigenvalue equation Eq. (6) of the teacher yields E[|c k | 2 ] ∼ k −(αt+t) = k −(αt+s) . After ranking eigenvalues by k, with multiplicity k s−1 from all the wavevectors having the same modulus k, one has When the filter size of the teacher t is lowered, some of the coefficients E[|c k | 2 ] vanish. As the target function becomes a composition of t-dimensional constituents, the only non-zero coefficients are found for k's which lie in some t-dimensional subspaces of the s-dimensional Fourier space. These subspaces correspond to the k having at most a patch of t consecutive non-vanishing components. In other words, E[|c k | 2 ] is finite only if k is effectively t-dimensional and the integral on the right-hand side of Eq. (23) becomes t-dimensional, thus If the teacher patches are not contained in the student ones, the target cannot be represented with a combination of student eigenfunctions, hence the error asymptotes to a finite value when P → ∞.

Empirical learning curves for ridgeless regression
This section investigates numerically the asymptotic behaviour of the learning curves for our teacherstudent framework. We consider different combinations of convolutional and local teachers and students with overlapping patches and Laplacian constituent kernels, i.e. C( In order to test the robustness of our results to the data distribution, data are uniformly generated in the hypercube [0, 1] d (results in Fig. 1) or on a d-hypersphere (results in Appendix G). Fig. 1 shows learning curves for both convolutional (left panels) and local (right panels) students learning a convolutional target function. The results in the case of a local teacher are presented in Appendix G, and display no qualitative differences.
In the following, we always refer to Fig. 1. Panels A and B show that, with α t = α s = 1, our prediction β = 1/s holds independently of the embedding dimension d. Furthermore, notice that fixing the dimension d and the teacher filter size t, the generalisation errors of a convolutional and a local student with the same filter size differ only by a multiplicative constant independent of P . Indeed, the shift-invariant nature of the convolutional student only results in a pre-asymptotic correction to our estimate of the generalisation error B(P ). In Appendix G, we check that this multiplicative constant corresponds to rescaling P by the number of patches, as predicted in Section 4. Panels C and D show learning curves for several values of s and fixed t. The curse of dimensionality is recovered when the size of the student filters coincides with the input dimension, both for local and convolutional students. Finally, panels E and F show learning curves for fixed t and s being smaller than, equal to or larger than t. We stress that, when s < t the student kernel cannot reproduce the target function, hence the error does not decrease by increasing P . Further details on the experiments are provided in Appendix G, together with learning curves for data distributed uniformly on the unit sphere S d−1 and for regression with the actual analytical and empirical NTKs of one-hidden-layer convolutional networks. It is worthwhile to notice that experiments are always in excellent agreement with our predictions, despite using data distributions that are out of the hypotheses of Theorem 4.1. Indeed, for regression with the actual NTK even the assumption of translationally-invariant constituents is violated. Moreover, we report the learning curves of local kernels on the CIFAR-10 dataset showing that smaller filter sizes correspond to faster decays even for real and anisotropic data distributions, in agreement with the picture emerging from our synthetic model.

Asymptotics of learning curves with decreasing ridge
We now prove an upper bound for the exponent β implying that the curse of dimensionality is beaten by a local or convolutional kernel learning a convolutional target (as in Eq. (2)), using the framework developed in [21] and a natural universality assumption on the kernel eigenfunctions. It is worth noticing that this framework does not require the target function to be generated by a teacher kernel. Proofs are presented in Appendix F. Let D(Λ) denote the density of eigenvalues of the student kernel, D(Λ) = ρ δ(Λ − Λ ρ ), with δ(x) denoting Dirac delta function. Having a random target function with coefficients c ρ in the kernel eigenbasis having variance E[|c ρ | 2 ], one can define the following reduced density (with respect to the teacher): D T (Λ) counts eigenvalues for which the target has a non-zero variance, such that: where the function c(Λ) is defined by The following theorem then follows from the results of [21].
Theorem 6.1. Let us consider a positive-definite kernel K with eigenvalues Λ ρ , ρ Λ ρ < ∞, and eigenfunctions Φ ρ learning a (random) target function f * in kernel ridge regression (Eq. (3)) with ridge λ from P observations f * (x µ ), with x µ ∈ R d drawn from a certain probability distribution. Let us denote with D T (Λ) the reduced density of kernel eigenvalues with respect to the target and (λ, P ) the generalisation error and also assume that i) For any P -tuple of indices ρ 1 , . . . , ρ P , the vector (Φ ρ1 (x 1 ), . . . , Φ ρ P (x P )) is a Gaussian random vector; ii) The target function can be written in the kernel eigenbasis with coefficients c ρ and , c 2 (Λ) ∼ Λ q asymptotically for small Λ and r > 0, r < q < r + 2; Then the following equivalence holds in the joint P → ∞ and λ → 0 limit with 1/(λ √ P ) → 0: Note that the assumption i) of the theorem on the Gaussianity of the eigenbasis does not hold in our setup where the Φ ρ 's are plane waves. However, the random variables Φ ρ (x µ ) have a probability density with compact support. It is thus natural to assume that a Gaussian universality assumption holds, i.e. that Theorem 6.1 applies to our problem. With this assumption, we obtain the following Corollary 6.1.1. Performing kernel ridge regression in a teacher-student scenario with smoothness exponents α t (teacher) and α s (student), with ridge λ ∼ P −γ and 0 < γ < 1/2, under the joint hypotheses of Theorem 4.1 and Theorem 6.1, the exponent governing the asymptotic scaling of the generalisation error with P is given by: which does not vanish in the limit d → ∞. Furthermore, Eq. (28) depends on s and not on t as the prediction of Eq. (20).

Conclusions and future work
Our work shows that, even in large dimension d, a function can be learnt efficiently if it can be expressed as a sum of constituent functions each depending on a smaller number of variables t, by performing regression with a kernel that entails such a compositional structure with s-dimensional constituents. The learning curve exponent is then independent of d and governed by s if s ≥ t, optimal for s = t and null if s < t.
In the context of image classification, this result relates to the "Bag of Words" viewpoint. Consider for example two-dimensional images consisting of M features of t adjacent pixels, and that different classes correspond to distinct subsets of (possibly shared) features. If features can be located anywhere, then data lie on a 2M -dimensional manifold. On the one hand, we expect a one-hiddenlayer convolutional network with filter size s ≥ t to learn well with a learning curve exponent governed by s and independent of M . On the other hand, a fully-connected network would suffer from the curse of dimensionality for large M .
Our work does not consider that the compositional structure of real data is hierarchical, with large features that consist of smaller sub-features. It is intuitively clear that depth and locality taken together are well-suited for such data structure [8,6]. Extending the present teacher-student framework to this case would offer valuable quantitative insights into the question of how many data are required to learn such tasks.

A Spectral bias in kernel regression
In this appendix we provide additional details about the derivation of Eq. (8) within the framework of [17,18]. Let us begin by recalling the definition of the kernel ridge regression estimator f of a target function f * , where H denotes the Reproducing Kernel Hilbert Space (RKHS) of the kernel K(x, y). After introducing the Mercer's decomposition of the kernel, the RKHS can be characterised as a subset of the space of functions lying in the span of the kernel eigenbasis, In other words, the RKHS contains functions having a finite norm ||f || H = f, f H with respect to the following inner product, Given any target function f * lying in the span of the kernel eigenbasis, the mean squared generalisation error of the kernel ridge regression estimator reads with c ρ denoting the ρ-th coefficient of the target f * and a ρ that of the estimator f , which depends on the ridge λ and on the training set {x µ } µ=1,...,P . Notice that, as f belongs to H by definition, ρ |a ρ | 2 /λ ρ < + ∞, whereas the c ρ 's are free to take any value. The authors of [17,18] found a heuristic expression for the average of the mean squared error over realisations of the training set {x µ }. Such expression, based on the replica method of statistical physics, reads 3 where κ(P ) satisfies κ λ (P ) In short, the replica method works as follows [39]: first one defines an energy function E(f ) as the argument of the minimum in Eq. (S1), then attribute to the predictor f a Boltzmann-like probability distribution P (f ) = Z −1 e −βE(f ) , with Z a normalisation constant and β > 0. As β → ∞, the probability distribution P (f ) concentrates around the solution of the minimisation problem of Eq. (S1), i.e. the predictor of kernel regression. Hence, one can replace f in the right-hand side of Eq. (S5) with an average over P (f ) at finite β, then perform the limit β → ∞ after the calculation so as to recover the generalisation error. The simplification stems from the fact that, once f is replaced with its eigendecomposition, the energy function E(f ) becomes a quadratic function of the coefficients c ρ . Then, under the assumption that the data distribution enters only via the first and second moments of the eigenfunctions φ ρ (x) w.r.t x, all averages in Eq. (S5) reduce to Gaussian integrals.
Mathematically, κ λ (P )/P is related to the Stieltjes transform [40] of the Gram matrix K P /P in the large-P limit. Intuitively, it plays the role of a threshold: the modal contributions to the error tend to 0 for ρ such that λ ρ k λ (P )/P , and to E[|c ρ | 2 ] for ρ such that λ ρ k λ (P )/P . This is equivalent to saying that the algorithm predictor f (x) captures only the eigenmodes having eigenvalue larger than k λ (P )/P (see also [41,21]).
This intuitive picture can actually be exploited in order to extract the learning curve exponent β from the asymptotic behaviour of Eq. (S6) and Eq. (S7) in the ridgeless limit λ → 0 + . In the following, we assume that both the kernel and the target function have a power-law spectrum, in particular First, we approximate the sum over modes in Eq. (S7) with an integral using the Euler-Maclaurin formula. Then we substitute the eigenvalues inside the integral with their asymptotic limit, λ ρ = Aρ −a . Since, κ 0 (P )/P → 0 as P → ∞, both these operations result in an error which is asymptotically independent of P . Namely, where in the second line, we changed the integration variable from ρ to σ = κ 0 (P )ρ a /(AP ). Since the integral in σ is finite and independent of P , we obtain that κ 0 (P )/P = O(P −a ). Similarly, we find that the mode-independent prefactor ∂ λ (κ λ (P )/P ) | λ=0 = O(1). As a result we are left with, modulo some P -independent prefactors, Following the intuitive argument about the thresholding role of κ 0 (P )/P ∼ P −a , it is convenient to split the sum in Eq. (S9) into sectors where λ ρ κ 0 (P )/P , λ ρ ∼ κ 0 (P )/P and λ ρ κ 0 (P )/P , i.e., Finally, Eq. (8) is obtained by noticing that, under our assumptions on the decay of E[|c ρ | 2 ] with ρ, the contribution of the sum over ρ P is subleading in P whereas the other two sums can be gathered together.

B NTKs of convolutional and locally-connected networks
We begin this section by reviewing the computation of the NTK of a one-hidden-layer fully-connected network [16]. Definition B.1 (one-hidden-layer FCN). A one-hidden-layer fully-connected network with H hidden neurons is defined as follows, where x ∈ R d is the input, H is the width, σ is a nonlinear activation function, {w h ∈ R d } H h=1 , {b h ∈ R} H h=1 , and {a h ∈ R} H h=1 are the network's parameters. The dot · denotes the standard Euclidean scalar product.

Inserting (B.1) into Definition 3.3, one obtains
where σ denotes the derivative of σ with respect to its argument. If all the parameters are initialised independently from a standard Normal distribution, Θ F C N (x, y; θ) is a random-feature kernel that in the H → ∞ limit converges to When σ is the ReLU activation function, the expectations can be computed exactly using techniques from the literature of arc-cosine kernels [36] Θ F C (x, y) = 1 2π x 2 + 1 y 2 + 1 (sin ϕ + (π − ϕ) cos ϕ) (S14) with ϕ denoting the angle ϕ = arccos x · y + 1 Notice that, as commented in Section 3, for ReLU networks Θ F C (x, y) displays a cusp at x = y.

Proof of Lemma 3.1
Proof. Inserting Eq. (9) into Definition 3.3, In the previous line, the single terms of the summation over patches are the random-feature kernels Θ F C N obtained in Eq. (S12) acting on s-dimensional inputs, i.e. the patches of x and y. Therefore, (S17) If all the parameters are initialised independently from a standard Normal distribution, the H → ∞ limit of Eq. (S17) yields Eq. (12).

Proof of Lemma 3.2
Proof. Inserting Eq. (10) into Definition 3.3, In the previous line, the single terms of the summation over patches are the random-feature kernels Θ F C N obtained in Eq. (S12) acting on s-dimensional inputs, i.e. the patches of x and y. Therefore, If all the parameters are initialised independently from a standard Normal distribution, Eq. (13) is recovered in the H → ∞ limit.

C Mercer's decomposition of convolutional and local kernels
In this section we prove the eigendecompositions introduced in Lemma 3.3 and Lemma 3.4, then extend them to overlapping-patches kernel (cf. C.1). We define the scalar product in input space between two (complex) functions f and g as Proof of Lemma 3.3 Proof. We start by proving orthonormality of the eigenfunctions. By writing the d-dimensional eigenfunctions Φ ρ in terms of the s-dimensional eigenfunctions φ ρ of the constituent kernel as in Eq. (17), for ρ, σ = 1, Separating the term in the sum over patches in which i and j coincide from the others, and since the patches are not overlapping, the RHS can be written as From the orthonormality of the eigenfunctions φ ρ , the first integral is non-zero and equal to one only when ρ = σ, while, from assumption i), p (s) (d s x)φ ρ (x) = 0 for all ρ > 1, so that the second integral is always zero. Therefore, Then, we prove that the eigenfunctions and the eigenvalues defined in Eq. (17) satisfy the kernel eigenproblem. For ρ = 1, where we used p (s) (d s y)C(x, y) = λ 1 from assumption i). For ρ > 1, Splitting the sum over l into the term with l = j and the remaining ones, the RHS can be written as Using assumption i), the third integral is always zero, therefore Finally, we prove the expansion of Eq. (16) from the definition of K CN ,

Proof of Lemma 3.4
Proof. We start again by proving the orthonormality of the eigenfunctions. By writing the ddimensional eigenfunctions Φ ρ,i in terms of the s-dimensional eigenfunctions φ ρ of the constituent kernel as in Eq. (19), for ρ, σ = 1, from the orthonormality of the eigenfunctions φ ρ when i = j, and assumption i), Then, we prove that the eigenfunctions and the eigenvalues defined in Eq. (19) satisfy the kernel eigenproblem. For ρ = 1, where we used p (s) (d s y)C(x, y) = λ 1 from assumption i). For ρ > 1, Splitting the sum over j in the term for which j = i and the remaining ones, the RHS can be written as Using assumption i), the third integral is always zero, therefore Finally, we prove the expansion of Eq. (16) from the definition of K LC ,

C.1 Spectra of convolutional kernels with overlapping patches
In this section Lemma 3.3 and Lemma 3.4 are extended to kernels with overlapping patches, having P = {1, . . . , d} and |P| = d. Such extension requires additional assumptions, which are stated below: ii) The constituent kernel C(x, y) is translationally-invariant, isotropic and periodic, Assumptions i) and ii) above imply that C(x, y) can be diagonalised in Fourier space, i.e. (with k denoting the s-dimensional wavevector) and the eigenvalues λ k depend only on the modulus of k, k = √ k · k.
Let us introduce the following definitions, after recalling that a s-dimensional patch x i of x is a contiguous subsequence of x starting at x i , i.e.
and that inputs are 'wrapped', i.e. we identify x i+nd with x i for all n ∈ Z.
• Two patches • let P denote the set of patch indices associated with a given kernel/architecture. We denote with P i the set of indices of patches which overlap with x i , i.e.
• Given two overlapping patches x i and x j with o-dimensional overlap, the union We also use the following notation for denoting subspaces of the k-space ∼ = Z s , F s is the set of all wavevectors k having nonvanishing extremal components k 1 and k s . For u < s, F u is formed by first considering only wavevectors having the last s − u components equal to zero, then asking the resulting u-dimensional wavevectors to have nonvanishing extremal components. Practically, F u contains wavevectors which can be entirely specified by the first u-dimensional patch k (u) 1 = (k 1 , . . . , k u ) but not by the first (u − 1)-dimensional one. Notice that, in order to safely compare k's in different F's, we have introduced an apex u denoting the dimensionality of the patch. Lemma C.1 (Spectra of overlapping convolutional kernels). Let K CN be a convolutional kernel defined as in Eq. (14a), with P = {1, . . . , d} and constituent kernel C satisfying assumptions i), ii) above. Then, K CN admits the following Mercer's decomposition, with eigenfunctions and eigenvalues Proof. We start by proving the orthonormality of the eigenfunctions. In general, by orthonormality of the s-dimensional plane waves φ k (x), we have with δ(k, q) denoting the multidimensional Kronecker delta. For fixed i, the three terms on the RHS correspond to j's such that x j does not overlap with x i , to j = i and to j's such that x j overlaps with x i , respectively. We recall that, in patch notation, k In addition, by taking k ∈ F s and q = q Thus the Φ k 's with k ∈ F s are orthonormal between each other and orthogonal to all Φ q (u) 1 's with u < s. Similarly, by taking k ∈ F u with u < s and q ∈ F v with v ≤ u, orthonormality is proven down to Φ k (1) 1 . The zero-th eigenfunction Φ 0 (x) = 1 is also orthogonal to all other eigenfunctions by Eq. (S45) with k = 0 and trivially normalised to 1.
Secondly, we prove that eigenfunctions from Eq. (S43) and eigenvalues from Eq. (S44) satisfy the kernel eigenproblem of K CN . For k = 0, proving that Λ 0 and Φ 0 satisfy the eigenproblem. For k = 0, When k ∈ F s , the deltas coming from the terms with j ∈ P j,± vanish, showing that the eigenproblem is satisfied with d. When k ∈ F u with u < s, as the last s − u components of k vanish, there are several q's satisfying the deltas in the bracket. There is q = k, from the l = j term, then there are the s − u q's such that δ(q These are all the q's having a u-dimensional patch equal to k (u) 1 and all the other elements set to zero, thus there are (s − u + 1) such q's. Moreover, as λ q depends only on the modulus of q, all these q's result in the same eigenvalue, and in the same eigenfunction l e iq·x / √ d, after the sum over patches. Therefore, Finally, we prove the expansion of the kernel in Eq. (S42), The terms on the RHS of Eq. (S51) are trivially equal to those of Eq. (S42) for k ∈ F s . All the k having s − u vanishing extremal components can be written as shifts of k (u) 1 ∈ F u , which has the last s − u components vanishing. But a shift of k does not affect λ k nor l e ik·x , leading to a degeneracy of eigenvalues having k which can be obtained from a shift of k (u) 1 ∈ F u . Such degeneracy is removed by restricting the sum over k to the sets F u , u ≤ s, of wavevectors with non-vanishing extremal components, and rescaling the remaining eigenvalues with a factor of (s − u + 1)/d, so that Eq. (S42) is obtained.
Lemma C.2 (Spectra of overlapping local kernels). Let K LC be a local kernel defined as in Eq. (14b), with P = {1, . . . , d} and constituent kernel C satisfying assumptions i), ii) above. Then, K LC admits the following Mercer's decomposition, and eigenvalues Proof. We start by proving the orthonormality of the eigenfunctions. The scalar product Φ k,i , Φ q,j depends on the relation between the i-th and j-th patches.
Clearly, Φ 0 , Φ 0 = 1 and setting one of q and k to 0 in Eq. (S56) yields orthogonality between Φ 0 and Φ k,i for all k = 0 and i = 1, . . . , d. For any k and q = 0, Eq. (S56d) implies ∈ F u and q is a shift of k (u) . But such a q would have q 1 = 0 and there is no eigenfunction Φ q with q 1 = 0, apart from Φ 0 . Hence, orthonormality is proven.
We then prove that eigenfunctions and eigenvalues defined in Eq. (S54) and Eq. (S55) satisfy the kernel eigenproblem of K LC . For k = 0, In general, For k ∈ F u , with u = 1, . . . , s, the deltas which set the first component of k to 0 are never satisfied, therefore The second term in brackets vanishes for k ∈ F s and the eigenvalue equation is satisfied with o+1 , 0) = 1 for any o ≥ u. As a result of the remaining deltas, the RHS of Eq. (S60) becomes a sum over all q's which can be obtained from shifts of k (u) 1 , which are s − u + 1 (including k (u) 1 itself). The patch x i which is multiplied by q in the exponent is also a shift of x l , thus all the factors e iq·xi appearing in the sum coincide with e ik (u) 1 ·xi . As λ q depends on the modulus of q, all these terms correspond to the same eigenvalue, λ k (u) 1 , so that Finally, we prove the expansion of the kernel in Eq. (S53), As in the proof of the eigendecomposition of convolutional kernels, all the k having s − u vanishing extremal components can be written as shifts of k (u) 1 ∈ F u , which has the last s − u components vanishing. The shift of k does not affect λ k nor the product φ k (x i )φ k (y i ), after summing over i leading to a degeneracy of eigenvalues which is removed by restricting the sum over k to the sets F u , u ≤ s, and rescaling the remaining eigenvalues λ k (u) D Proof of Theorem 4.1 Theorem D.1 (Theorem 4.1 in the main text). Let K T be a d-dimensional convolutional kernel with a translationally-invariant t-dimensional constituent and leading nonanalyticity at the origin controlled by the exponent α t > 0. Let K S be a d-dimensional convolutional or local student kernel with a translationally-invariant s-dimensional constituent, and with a nonanalyticity at the origin controlled by the exponent α s > 0. Assume, in addition, that if the kernels have overlapping patches then s ≥ t; whereas if the kernels have nonoverlapping patches s is an integer multiple of t; and that data are uniformly distributed on a d-dimensional torus. Then, the following asymptotic equivalence holds in the limit P → ∞, B(P ) ∼ P −β , β = α t /s.
Proof. For the sake of clarity, we start with the proof in the nonoverlapping-patches case, and then extend it to the overlapping-patches case. Since K T and K S have translationally-invariant constituent kernels and data are uniformly distributed on a d-dimensional torus, the kernels can be diagonalised in Fourier space. Let us start by considering a convolutional student: because of the constituent kernel's isotropy, the Fourier coefficients Λ (s) k of K S depend on k (modulus of k) only. Notice the superscript indicating the dimensionality of the student constituents. In particular, Λ (s) k is a decreasing function of k and, for large k, Λ k ∼ k −(s+αs) . Then, B(P ) reads where k c (P ) is defined as the wavevector modulus of the P -th largest eigenvalue and E[|c k | 2 ] denotes the variance of the target coefficients in the student eigenbasis. k c (P ) is such that there are exactly P eigenvalues with k ≤ k c (P ), i.e. k c (P ) ∼ P 1/s . By denoting the eigenfunctions of the student with Φ (s) k , the superscript (s) indicating the dimension of the constituent's plane waves, Decomposing the teacher kernel K T into its eigenvalues Λ (t) The q = 0 mode of the teacher can give non-vanishing contributions to c 0 only, therefore it does not enter any term of the sum in Eq. (S64). Once we removed the term with q = 0, consider the y-integral: Therefore, E[|c k | 2 ] is non-zero only for k's which have at most t consecutive components greater or equal than zero, and the remaining s − t being strictly zero. Inserting Eq. (S71) into Eq. (S64), When using a local student, the convolutional eigenfunctions in the RHS of Eq. (S66) are replaced by the local eigenfunctions Φ k,i (x) of Eq. (18). Repeating the same computations, one finds As a result, As we showed in Appendix C, when the patches overlap the set of wavevectors which index the eigenvalues is restricted from Z s to the union of the F u 's for u = 0, . . . , s. In addition, the eigenvalues with k ∈ F u , 0 < u < s, are rescaled by a factor (s − u + 1)/d. Therefore, in the overlapping case the eigenvalues do not decrease monotonically with k and B(P ) cannot be written as a sum of over k's with modulus k larger than a certain threshold k c . By considering also that, with t ≤ s, E[|c k | 2 ] is non-zero only for k's which have at most t consecutive nonvanishing components, we have where Λ P denotes the P -th largest eigenvalue and the indicator function χ(Λ (s) k > Λ P ) ensures that the sum runs over all but the first P eigenvalues of the student. The sets {F u } u<t have all null measure in Z t , whereas F t is dense in Z t , thus the asymptotics of B(P ) are dictated by the sum over F t . When k's are restricted to the latter set, eigenvalues are again decreasing functions of k and the constraint Λ (s) k > Λ P translates into k > k c (P ). Having changed, with respect to the nonoverlapping case, only an infinitesimal fraction of the eigenvalues, the asymptotic scaling of k c (P ) with P remains unaltered and the estimates of Eq. (S72) and Eq. (S74) extend to kernels with nonoverlapping patches after substituting the degeneracy d/s with |P| = d.

E Asymptotic learning curves with a local teacher
Theorem E.1. Let K T be a d-dimensional local kernel with a translationally-invariant t-dimensional constituent and leading nonanalyticity at the origin controlled by the exponent α t > 0. Let K S be a d-dimensional local student kernel with a translationally-invariant s-dimensional constituent, and with a nonanalyticity at the origin controlled by the exponent α s > 0. Assume, in addition, that if the kernels have overlapping patches then s ≥ t; whereas if the kernels have nonoverlapping patches s is an integer multiple of t; and that data are uniformly distributed on a d-dimensional torus. Then, the following asymptotic equivalence holds in the limit P → ∞, Proof. The proof is analogous to that of Appendix D, the only difference being that eigenfunctions and eigenvalues are indexed by k and the patch index i. This results in an additional factor of d/s in the RHS of Eq. (S65). All the discussion between Eq. (S66) and Eq. (S71) can be repeated by attaching the additional patch index i to all coefficients. Eq. (S72) for B(P ) is then corrected with an additional sum over patches. The extra sum, however, does not influence the asymptotic scaling with P .
F Proof of Theorem 6.1 Theorem F.1 (Theorem 6.1 in the main text). Let us consider a positive-definite kernel K with eigenvalues Λ ρ , ρ Λ ρ < ∞, and eigenfunctions Φ ρ learning a (random) target function f * in kernel ridge regression (Eq. (3)) with ridge λ from P observations f * µ = f * (x µ ), with x µ ∈ R d drawn from a certain probability distribution. Let us denote with D T (Λ) the reduced density of kernel eigenvalues with respect to the target and (λ, P ) the generalisation error and also assume that i) For any P -tuple of indices ρ 1 , . . . , ρ P , the vector (Φ ρ1 (x 1 ), . . . , Φ ρ P (x P )) is a Gaussian random vector; ii) The target function can be written in the kernel eigenbasis with coefficients c ρ and , c 2 (Λ) ∼ Λ q asymptotically for small Λ and r > 0, r < q < r + 2; Then the following equivalence holds in the joint P → ∞ and λ → 0 limit with 1/(λ √ P ) → 0: Proof. In this proof we make use of results derived in [21]. Our setup for kernel ridge regression correspond to what the authors of [21] call the classical setting. Let us introduce the integral operator T K associated with the kernel, defined by The trace T r[T K ] coincide with the sum of K's eigenvalues and is finite by hypothesis. We define the following estimator of the generalisation error (λ, P ), where ϑ(λ) is the signal capture threshold (SCT) [21] and A ϑ = T K (T K +ϑ(λ)) −1 is a reconstruction operator [21]. The target function can be written in the kernel eigenbasis by hypothesis (with coefficients c ρ ) and T K has the same eigenvalues and eigenfunctions of the kernel by definition. Hence, where D T is the reduced density of eigenvalues Eq. (25). We now derive the asymptotics of R(λ, P ) in the joint P → ∞ and λ → 0 limit, then relate the asymptotics of R to those of (λ, P ) via a theorem proven in [21]. Proposition 3 of [21] shows that for any λ > 0, ∂ λ ϑ(λ) → 1 and ϑ(λ) → λ with corrections of order 1/N . Thus, we focus on the following integral, The functions D T (Λ) and c 2 (Λ) can be safely replaced with their small-Λ expansions Λ −(1+r) and Λ q over the whole range of the integral above because of the assumptions q > r and q ≤ r + 2. In practice, there should be an upper cut-off on the integral coinciding with the largest eigenvalue Λ 1 , but the assumption q ≤ r + 2 causes this part of the spectrum to be irrelevant for the asymptotics of the error. In fact, we will conclude that the integral is dominated by the portion of the domain around λ. After the change of variables y = Λ/λ, where one recognises one of the integral representations of the beta function, with Γ denoting the gamma function. Therefore, It is interesting to notice how the assumptions q > r and q < r + 2 are required in order to avoid the poles of the Γ functions in the RHS of Eq. (S86).
Remark The estimate for the exponent β of Corollary 6.1.1 follows from the theorem above with r = t/(s + α s ), q = (α t + t)/(α s + s) and λ ∼ P −γ . Then q > r because α t > 0, whereas the condition q < r + 2 is equivalent to the assumption α t < 2(α s + s) required in Section 4 in order to derive the learning curve exponent in Eq. (20) from our estimate of B(P ).

G Numerical experiments G.1 Details on the simulations
To obtain the empirical learning curves, we generate P + P test random points uniformly distributed in a d-dimensional hypercube or on the surface of a d − 1-dimensional hypersphere embedded in  if not specified otherwise. The sample complexity P of the local students is rescaled with the number of patches to highlight the pre-asymptotic effect of shift-invariance on the learning curves.
d dimensions. We use P ∈ {128, 256, 512, 1024, 2048, 4096, 8192} and P test = 8192. For each value of P , we generate a Gaussian random field with covariance given by the teacher kernel, and we compute the kernel ridgeless regression predictor of the student kernel using Eq. (4) with the P training samples. The generalisation error defined in Eq. (5) is approximated by computing the empirical mean squared error on the P test unseen samples. The expectation with respect to the target function is obtained averaging over 128 independent teacher Gaussian processes, each sampled on different points of the domain. As teacher and student kernels, we consider different combinations of the convolutional and local kernels defined in Eq. (14a) and Eq. (14b), with Laplacian constituents C(x i − x j ) = e − xi−xj and overlapping patches. In particular, • the cases with convolutional teacher and both convolutional and local students with various filter sizes are reported in Fig. 1 and Fig. S3 for data distributed in a hypercube and on a hypersphere respectively; • the cases with local teacher and both local and convolutional students are reported in Fig. S2 for data distributed in a hypercube.
Experiments are run on NVIDIA Tesla V100 GPUs using the PyTorch package. The approximate total amount of time to reproduce all experiments with our setup is 400 hours. Code for reproducing the experiments can be found at https://github.com/fran-cagnetta/local_kernels.

G.2 Additional experiments
Convolutional vs local students In Fig. S1 we report the empirical learning curves for convolutional and local student kernels learning a convolutional teacher kernel, with filter sizes s and t respectively. Data are uniformly sampled in the hypercube [0, 1] d . By rescaling the sample complexity P of the local students with the number of patches |P| = d, the learning curves of local and convolutional students overlap, confirming our prediction on the role of shift-invariance. Indeed, the local student has to learn the same local task at all the possible patch locations, while the convolutional student is naturally shift-invariant.
Local teacher In Fig. S2 we report the empirical learning curves for a local teacher kernel and data uniformly sampled in the hypercube [0, 1] d . In panels I and J, also the student is a local kernel and the same discussion of Section 5 applies. In panel K, the student is a convolutional kernel and the generalisation error does not decrease by increasing the size of the training set. Indeed, a local non-shift-invariant function is not on the span of the eigenfunctions of a convolutional kernel, and therefore the student is not able to learn the target.
Spherical data In Fig. S3 we report the empirical learning curves for convolutional teacher and convolutional (left panels) and local (right panels) student kernels. Data are restricted to the unit sphere S d−1 . Panels L-O are the analogous of panels A-D of Fig. 1. Notice that when the filter size of the student coincides with d (panels P, Q), the learning curves decay with exponent β = 1/(d − 1) (instead of β = 1/d). Indeed, for data normalised on S d−1 , the spectrum of the Laplacian kernel decays at a rate O(k −α−(d−1) ) with α = 1. However, as the student filter size is lowered, we recover the exponent 1/s independently of the dimension d of input space, as derived for data on the torus and shown empirically for data in the hypercube. In fact, we expect that the s-dimensional marginals of the uniform distribution on S d−1 become insensitive to the spherical constraint when s d.
Convolutional NTKs In Fig. S4 we report the empirical learning curves obtained using the NTK of one-hidden-layer CNNs with ReLU activations, which corresponds to using the kernel Θ F C defined in Eq. (S14) as the constituent. Since this kernel is not translationally invariant, it cannot be diagonalised in the Fourier domain, and the analysis of Section 4 does not apply. However, as shown in panels P-S, the same learning curve exponents β of the Laplacian-constituent case are recovered. Indeed, Θ F C and the Laplacian kernel share the same nonanalytic behaviour in the origin, and their spectra have the same asymptotic decay [32]. In Fig. S5 we present the same plots of panels R and S, but instead of the analytical NTKs, we compute numerically the kernels of randomly-initialised very-wide CNNs (H ≈ 10 6 ).
Real data In Fig. Fig. S6 we report the learning curves of local kernels with Laplacian constituents applied to the CIFAR-10 dataset. We build the tasks by selecting two classes and assigning label +1 to data from one class and −1 to data from the other class. As before, we use P ∈ {128, 256, 512, 1024, 2048, 4096, 8192} and P test = 8192. Differently from our assumptions, image data are strongly anisotropic, and the distance between nearest-neighbour points decays faster than P −1/d . Indeed, target functions defined on data of this kind are usually not cursed with the full dimensionality d of the inputs, but rather with an effective dimension d eff . d eff is related to the dimension of the manifold in which data lie [4], and may also vary when extracting patches of different sizes. Nonetheless, as we found in our synthetic setup, the learning curve exponent β increases monotonically with the filter size of the kernel, strengthening the concept that leveraging locality is key for performance.