Gradient-Based Empirical Risk Minimization using Local Polynomial Regression

In this paper, we consider the problem of empirical risk minimization (ERM) of smooth, strongly convex loss functions using iterative gradient-based methods. A major goal of this literature has been to compare different algorithms, such as gradient descent (GD) or stochastic gradient descent (SGD), by analyzing their rates of convergence to $\epsilon$-approximate solutions. For example, the oracle complexity of GD is $O(n\log(\epsilon^{-1}))$, where $n$ is the number of training samples. When $n$ is large, this can be expensive in practice, and SGD is preferred due to its oracle complexity of $O(\epsilon^{-1})$. Such standard analyses only utilize the smoothness of the loss function in the parameter being optimized. In contrast, we demonstrate that when the loss function is smooth in the data, we can learn the oracle at every iteration and beat the oracle complexities of both GD and SGD in important regimes. Specifically, at every iteration, our proposed algorithm performs local polynomial regression to learn the gradient of the loss function, and then estimates the true gradient of the ERM objective function. We establish that the oracle complexity of our algorithm scales like $\tilde{O}((p \epsilon^{-1})^{d/(2\eta)})$ (neglecting sub-dominant factors), where $d$ and $p$ are the data and parameter space dimensions, respectively, and the gradient of the loss function belongs to a $\eta$-H\"{o}lder class with respect to the data. Our proof extends the analysis of local polynomial regression in non-parametric statistics to provide interpolation guarantees in multivariate settings, and also exploits tools from the inexact GD literature. Unlike GD and SGD, the complexity of our method depends on $d$ and $p$. However, when $d$ is small and the loss function exhibits modest smoothness in the data, our algorithm beats GD and SGD in oracle complexity for a very broad range of $p$ and $\epsilon$.

(GD) [38], stochastic gradient descent (SGD) [36], and their refinements, e.g., [14,26,39]. A significant portion of this literature is concerned with analyzing the convergence rates of different iterative algorithms and determining which ones have smaller oracle complexities, i.e., smaller number of gradient queries required to obtain approximate minimizers, under various levels of convexity, smoothness, sampling, and other assumptions (see, e.g., [6,40] and the references therein). We focus on the widely studied setting of strongly convex loss functions. In this case, as depicted in Table 1, the (first order) oracle complexity of GD is known to be O(n log(ǫ −1 )) [38,Theorem 2.1.15], where n is the number of training samples in the ERM objective function, and ǫ > 0 is the desired approximation accuracy of the solution. In contrast, it is well-known that SGD has an oracle complexity of O(ǫ −1 ) [36]. Since n is often extremely large in modern instances of ERM, SGD is theoretically preferable to GD. This is corroborated by the ubiquitous use of SGD and its variants for ERM in practice. Our goal is to improve upon the oracle complexities of GD, SGD, and their variants by exploiting additional information about the loss functions that has heretofore been neglected in the literature.
Recall that the general unconstrained ERM problem is usually abstracted as, cf. [6, Section 1], [21]: where each f i : R p → R represents the loss function corresponding to a different sample of training data. (Note that this formulation, modulo the extra scaling n −1 , is also utilized in distributed optimization, where each f i represents the local objective function corresponding to an individual agent in a network of n agents [34].) As suggested by the form of (1), typical analysis of iterative gradient-based algorithms, such as GD or SGD, makes no assumptions about f i as a function of i. However, the loss function often varies quite smoothly with the data in ERM. For example, given training data (x (i) , y i ) ∈ R p × R : i ∈ [n] (with [n] {1, . . . , n}), the classical least squares formulation of linear regression with regularization is (see, e.g., [21,Chapter 3]): where the objective is to find the optimal coefficients θ ∈ R p , and R(θ) is some regularization term, e.g., ridge or lasso. Letting f (x, y; θ) = y − θ T x 2 + R(θ) for x, θ ∈ R p and y ∈ R, we see that each f i (θ) = f (x (i) , y i ; θ) in formulation (1) for this problem. Clearly, f is very smooth as a function of the data (x, y), regardless of the regularization R(θ). This observation that loss functions are often smooth in the data holds in more complex learning scenarios too. For instance, when training a deep neural network with a smooth loss, e.g., cross-entropy loss, and smooth activation functions, e.g., logistic (sigmoid) functions, the induced loss function for the ERM objective function is infinitely differentiable in both the data and the network's weight parameters [5,Chapter 5]. Since such smoothness in the data has not been exploited by modern analysis of iterative gradient-based optimization algorithms for (1) (see Section 1.1), the main thrust of this paper is to speed up algorithms for ERM using this additional smoothness in the data. To this end, in the remainder of this work, we consider the following problem formulation of ERM. For d, n ∈ N, assume that we are given n samples of d-dimensional training data D = {x (i) ∈ [h ′ , 1 − h ′ ] d : i ∈ [n]}, where h ′ > 0 is an arbitrarily small parameter (to be determined later; see (19)), and each training sample belongs to the hypercube [h ′ , 1 − h ′ ] d without loss of generality. (This compactness assumption affords us some analytical tractability in the sequel.) Note that when d ≥ 2, the first d − 1 elements of a sample can be perceived as a (d − 1)-dimensional feature vector, and the last element of the sample can be construed as the corresponding continuous-valued label (as in the linear regression example above). For a given p ∈ N, fix the (parametrized) loss function f : [0, 1] d × R p → R, f (x; θ), where θ = (θ 1 , . . . , θ p ) ∈ R p is a p-dimensional parameter vector. Now Table 1 Oracle complexities of GD, SGD, and our algorithm for strongly convex loss functions: Here, ǫ > 0 is the approximation accuracy, p ∈ N is the dimension of the parameter space, d ∈ N is the dimension of the data space, n ∈ N is the number of training data samples, η > 0 is the Hölder exponent determining the smoothness of the loss function in the data space, and C(d, η) = C µ,L 1 ,L 2 ,b,c (d, l) is given in (28).

Algorithm
Oracle complexity GD O(n log(ǫ −1 )) [38] SGD O(ǫ −1 ) [36] LPI-GD O(C(d, η)(pǫ −1 ) d/(2η) log(pǫ −1 )) ( Theorem 2) define the ERM objective function F : R p → R as follows: which is the empirical risk, i.e., the empirical expectation of the loss function with respect to the training data D. With this objective function, we consider the unconstrained minimization problem: where F * ∈ R denotes the minimum value. In contrast to the canonical formulation (1), our formulation in (3) and (4) explicitly states the dependence of the loss function on the data. Under the smoothness and strong convexity assumptions delineated in Section 1.3, we make the following main contributions: 1. We propose a new inexact gradient descent algorithm to compute approximate solutions of (4) in Section 2.2.1 (see Algorithm 1). The main innovation of this algorithm is to use the smoothness of the loss function f in the data to learn the gradient oracle of f at every iteration by performing local polynomial regression based on oracle queries over a set of virtual data points. 2. We derive the iteration and oracle complexities of this algorithm in Proposition 2 and Theorem 2, respectively, in Section 2.2.2. In particular, we show that the oracle complexity scales likeÕ((pǫ −1 ) d/(2η) ) (neglecting sub-dominant factors) as indicated in the last row of Table 1, where η > 0 denotes the Hölder exponent determining the smoothness of the gradient of the loss function with respect to the data (see Section 1.3). Although we focus on strongly convex loss functions to leverage simple convergence results on inexact gradient descent from the literature, analogous analysis holds for larger classes of loss functions, e.g., when f (x; ·) : R p → R is non-convex. 3. Furthermore, we show in Proposition 3 in Section 2.2.2 that if the data dimension is sufficiently small, e.g., d = O(log log(n)) (as in healthcare data analytics or control applications; see Section 1.1), and the loss function f exhibits modest smoothness in the data, i.e., η = Θ(d), then the oracle complexity for our method beats the oracle complexities (or more precisely, the oracle complexity bounds) of both GD and SGD for any p that grows polynomially in n and any ǫ that decays polynomially in n −1 . This demonstrates that smoothness of loss functions in the data can be successfully exploited in gradient-based algorithms for ERM to significantly lower their oracle complexity. 4. Finally, in order to analyze the convergence rate of our algorithm, we also generalize the technique of local polynomial regression, or more precisely, local polynomial interpolation (since there is no noise in our problem), in non-parametric statistics to the multivariate setting in Theorem 1 in Section 2.1. Although it is implied in many standard expositions that univariate local polynomial regression can be generalized to multiple variables (see, e.g., [17], [44], or [46]), the multivariate case requires significantly more care than the univariate case. Consequently, a detailed and rigorous analysis of local polynomial interpolation is difficult to find in the literature. Therefore, we believe that Theorem 1 may be of independent interest in statistics.
Let us briefly outline the rest of this paper. We discuss some related literature in Section 1.1 that further distinguishes our contributions from known results, and conveys that our results apply to practical machine learning settings. Then, we introduce required notation and definitions, formally state key assumptions, and briefly elaborate on the standard framework of oracle complexity in optimization theory in Sections 1.2, 1.3, and 1.4, respectively. In Section 2, we present our main results as enumerated above. We prove our results on local polynomial regression in Section 3, and carry out the convergence analysis of our algorithm in Section 4. Finally, we conclude our discussion and propose future research directions in Section 5.

Related Literature
There is an enormous literature concerning the analysis of convergence rates of exact and inexact gradient-based algorithms that perform the optimization in (1). We do not attempt to survey this entire field here, and instead only mention some noteworthy works. For example, in our strongly convex setting of interest, the convergence analysis of GD can be found in the standard text [38], and corresponding convergence analysis of SGD can be found in [36] (also see [40] and the references therein). Moreover, various refinements and improvements of these basic iterative algorithms have been analyzed in the literature. Notable examples include the famous accelerated (or momentum based) GD [39], mini-batch SGD methods [14], and variance reduced SGD methods [26] (also see the references therein). We also refer readers to [6] for a unified treatment of many of the aforementioned approaches. More generally, inexact gradient descent methods for ERM have been analyzed extensively as well-see, e.g., [20,43] and the references therein for work on the strongly convex case. (Note that SGD can be perceived as an example of an inexact GD method, which uses unbiased estimates of the true gradient of the ERM objective function at every iteration.) This paper and most of the aforementioned references focus on obtaining upper bounds on oracle complexity. Conversely, several authors have also established fundamental lower bounds, cf. [1,7] and the references therein. As mentioned earlier, to our knowledge, all of this literature concerns the formulation in (1). Hence, these methods do not exploit smoothness of loss functions in data by learning gradients in every iteration as in our proposed method.
On the other hand, motivated by the utility of gradient information for various tasks like variable selection and determining coordinate covariations, cf. [31,32], the problem of learning gradients in the context of supervised learning problems, such as classification and regression, has received a lot of attention from the machine learning community. Specifically, most of this literature aims to simultaneously learn a classification or regression function along with its gradient from training data. (Again, we only mention some noteworthy works here.) For example, the text [17] presents the classical theory of gradient estimation using local polynomial fitting, and [13] presents a gradient estimation approach that uses local polynomial regression, but does not estimate the regression function. In a related vein, [15] studies kernel methods for gradient estimation. Various other methods for learning gradients have also been analyzed, such as reproducing kernel Hilbert space methods [31,32], regression splines [47], and nearest neighbor methods [3]. However, the vast majority of these techniques have not been considered in the context of optimization for machine learning.
Two recent exceptions where gradient estimation has been applied to optimization are [45] and [3]. In particular, the authors of [45] illustrate that learning gradients leads to desirable convergence rates for derivative-free optimization algorithms, which only use zeroth order oracles (i.e., these oracles only output function values, not gradient values). Similarly, [3] portrays the effectiveness of its nearest neighbor based gradient estimator for derivative-free optimization. It is worth noting that both these works analyze optimization of a single objective function (rather than a sum as in (1)). Hence, the results derived in these papers are not particularly insightful in our ERM setting in (3) and (4), where we seek to learn gradients based on a first order oracle by exploiting the smoothness of f (x; θ) in the data x.
Lastly, as we remarked earlier, we establish in Proposition 3 that the oracle complexity of our algorithm beats GD and SGD for a broad range of values of p and ǫ, if the data dimension d is sufficiently small, e.g., d = O(log log(n)). Since we utilize multivariate local polynomial regression in our algorithm, a "small d assumption" is theoretically unavoidable due to the curse of dimensionality, cf. [17,Section 7.1]. We close this discussion of related literature by further explaining why this assumption is also very reasonable in applications. Indeed, many application domains have problems where the number of data attributes is small. For example, in the healthcare data analytics space, electronic health records often have very few features per patient, e.g., temperature, blood pressure, initial diagnosis, and medications, but machine learning techniques are still widely used to build models for diagnosing diseases or predicting risk (cf. [9] and its references). As another example, in the problem of system identification for robotics and control applications, neural networks have been used to build non-linear state-space models (cf. [33,35] and their references). The number of inputs of these networks is often very small when the model order is known or assumed to be low. On the other hand, in many applications with high-dimensional "raw data," data scientists believe that the useful information in the data lives in a low-dimensional manifold. This belief is the basis of a myriad of dimensionality reduction techniques, such as principal component analysis [23,41], canonical correlation analysis [24], Laplacian eigenmaps [4] or diffusion maps [11], and modal decompositions (or correspondence analysis), cf. [25], [28,Chapter 4], and the references therein. As a result, many high-dimensional data analysis pipelines proceed by first pre-processing the data using pertinent dimensionality reduction techniques, and then using the resulting low-dimensional features for other learning tasks. In such high-dimensional settings, our contributions in this paper pertain to ERM using the extracted low-dimensional features.

Notation
We briefly introduce some notation that is used throughout this paper. Let N {1, 2, 3, . . . } and Z + {0, 1, 2, 3, . . . } denote the sets of natural numbers and non-negative integers, respectively. As mentioned earlier, for any n ∈ N, let [n] {1, . . . , n} be the set of positive integers less than n + 1. In the sequel, we analyze a d-variate problem setting with d ∈ N. So, we briefly introduce the standard multi-index notation. For any d-tuple of non-negative integers s = (s 1 , . . . , s d ) ∈ Z d + , we let |s| = s 1 + · · ·+ s d , s! = s 1 ! · · · s d !, and x s = x s1 1 · · · x s d d for x ∈ R d . Moreover, given a continuously differentiable function f : 1 · · · ∂x s d d ) to denote its sth partial derivative, and ∇ x f = ∂f ∂x1 · · · ∂f ∂x d T to denote its gradient. We will sometimes find it convenient to index certain vectors or matrices using the set {s ∈ Z d + : |s| ≤ l}. In these cases, to help with calculations, we will assume that these vectors or matrices are written out with the indices in lexicographical order (i.e., for any r, s ∈ Z d + with r = s, we say that r < s if r j < s j for the first position j ∈ [d] where r j = s j ).
Furthermore, for any vector x ∈ R n with n ∈ N, we let x i be the ith entry of x, where i ∈ [n] (or more generally, i belongs to an appropriate index set), and we let x p be the usual ℓ p -norm of x with p ∈ [1, +∞]. Similarly, for any matrix A ∈ R m×n with m, n ∈ N, we let A i,j be the (i, j)th element of A, where i ∈ [m] and j ∈ [n] (or more generally, i and j belong to appropriate index sets), and we let A F be the Frobenius norm of A. In addition, A T represents the transpose of A, and when m = n, A −1 represents the inverse of A if it is invertible, tr(A) represents the trace of A, det(A) represents the determinant of A, and λ min (A) represents the minimum eigenvalue of A. We also let I n be the n × n identity matrix. For any pair of symmetric matrices A, B ∈ R n×n , we use to denote the Löwner partial order and write A B to mean that A − B is positive semidefinite.
Finally, we introduce some useful miscellaneous notation. We let ⌈·⌉ denote the ceiling function, ½{·} denote the indicator function that equals 1 if its input proposition is true and equals 0 otherwise, exp(·) and log(·) denote the natural exponential and logarithm functions with base e, respectively, and E[·] denote the expectation operator, where the underlying probability measure can be easily inferred from context. Throughout this paper, we will also utilize standard Bachmann-Landau asymptotic notation, e.g., O(·), Θ(·), ω(·), etc., where the underlying variable that tends to infinity will be clear from context. Moreover, we will let poly(n) and polylog(n) denote polynomial and poly-logarithmic scaling in the variable n, respectively.

Smoothness Assumptions and Approximate Solutions
In order to perform tractable convergence analysis of gradient-based iterative algorithms that solve (4), we must formally define the problem class by specifying any smoothness conditions on the loss function, the appropriate notion of "approximate solutions," and the oracle model of computational complexity used [38, Section 1.1]. This section and Section 1.4 precisely define our problem class.
Recall our formulation where we are given n samples of d-dimensional training data D and a loss function f : [0, 1] d × R p → R, f (x, θ), and we consider the ERM problem in (3) and (4). Let ǫ > 0 be any small constant representing the desired level of approximation accuracy. We begin by imposing the following assumptions on our loss function: 1. Smoothness in parameter: There exists L 1 > 0 such that for all fixed x ∈ [0, 1] d , the gradient of f with respect to θ, denoted ∇ θ f (x; ·) : R p → R p , exists and is L 1 -Lipschitz continuous as a function of the parameter vector: 2. Smoothness in data: There exist η > 0 and L 2 > 0 such that for all fixed ϑ ∈ R p and for all i ∈ [p], the ith partial derivative of f (with respect to θ i ) at ϑ, denoted g i (·; ϑ) 3. Strong convexity: There exists µ > 0 such that for all fixed x ∈ [0, 1] d , the map f (x; ·) : R p → R is µ-strongly convex as a function of the parameter vector: Moreover, as additional notation, we let σ L 1 /µ represent the condition number of the problem.
Under these assumptions, we define an ǫ-approximate solution to the optimization problem (4) as any parameter vector θ * ∈ R p that satisfies We make two pertinent remarks at this point. Firstly, it can be argued based on the above strong convexity assumption (as shown in Section 4.1) that F * is finite and achievable by a unique global minimizer of F ; in particular, the infimum in (4) may be replaced by a minimum. Secondly, both the first and third assumptions are standard in the optimization literature, cf. [6,38]. As noted earlier, our main idea is the observation that the second assumption, that gradients of loss functions are also smooth in the data, holds in many machine learning applications. For example, in the regularized linear regression example in (2), ∇ θ f is clearly smooth with respect to the data (assuming R is differentiable).

Oracle Complexity
Finally, we briefly introduce the formalism of "first order oracle complexity" upon which our analysis will be based. The notion of oracle complexity was first introduced in [37] to circumvent the difficulties faced by vanilla computational complexity theory style analysis of convex programming (cf. [1, Section 1]). As elucidated in [38, Section 1.1], although "arithmetical complexity," which accounts for the running time of all basic arithmetic operations executed by an algorithm, is a more realistic measure of computational complexity, the idea of oracle (or "analytical") complexity, i.e., the total number of oracle calls made by an algorithm, is a more convenient abstraction for the analysis of iterative optimization methods. Indeed, the running time of oracle queries is often the main bottleneck in arithmetical complexity.
Observe that any iterative algorithm that approximately solves optimization problems belonging to the class of ERM problems defined in (3) and (4), under the assumptions in Section 1.3, requires two inputs: a) the training data D, and b) a "description" of the specific loss function f under consideration. To fulfill the latter requirement, we will assume that such algorithms have access to a local first order oracle O : which returns the value of the gradient ∇ θ f with respect to θ for any input query pair (x, ϑ). (Note that sometimes the oracle is defined to produce values of both the loss function f and its gradient ) This oracle O can be construed as a black box "description" of the gradient of the loss function f with respect to θ. Given any particular iterative algorithm, A = A(D, O), that solves our class of ERM problems, its first order oracle complexity (or just oracle complexity) Γ (A) is formally defined as the minimum number of oracle calls made by the algorithm A to obtain an ǫ-approximate solution (in the sense of (8)) [38, Section 1.1]. Thus, Γ (A) is typically a function of the notable problem variables n, d, p, η, ǫ, and the fixed constants L 1 , L 2 , µ. In this paper, following the paradigm of the optimization literature, we will focus on determining the first order oracle complexity of our proposed optimization method.

Main Results and Discussion
In this section, we first present our generalization of local polynomial regression to the multivariate case in Section 2.1, and then describe our proposed algorithm and explain its convergence analysis in Section 2.2.

Local Polynomial Regression
We commence by introducing the essentials of local polynomial regression specialized to our noiseless setting. Since most references on the subject only present the univariate (i.e., d = 1) framework, e.g., [17], [44, Section 1.6], we rigorously carry out the much more involved calculations for the multivariate case by carefully generalizing the univariate development in [44, Section 1.6]. We also refer readers to [10, Section 2] for classical references and a brief history of smoothing using local regression.
Consider a function g : [0, 1] d → R that belongs to the (η, L 2 )-Hölder class, namely, g is l = ⌈η⌉ − 1 times differentiable and satisfies (6) for every s ∈ Z d + with |s| = l. (Proposition 4 in Appendix A conveys some intuition behind what this condition means.) In order to present the local polynomial regression based approximation of this function, we first state some required definitions and notation. Let which has cardinality |G m | = m d . Finally, define the vector(-valued function) U : R d → R D as: (via the hockey-stick identity). Then, for any arbitrarily small value of the bandwidth parameter h ∈ 0, 1 2 , we can write the following weighted regression problem: As we shall see,φ provides a "good" approximation of the function g in the supremum (or Chebyshev) norm.
For any which is also positive semidefinite due to the non-negativity of the kernel. We next present a key technical proposition which generalizes [44, Lemma 1.5] to the d-variate setting and demonstrates that B(x) is actually positive definite.
Then, for all m ∈ N and where the constant Λ(d, l) > 0 is defined as Proposition 1 is established in Section 3.1 by exploiting several basic ideas from the theory of orthogonal polynomials and matrix perturbation theory. The proof is much more sophisticated than [44, Lemma 1.5] because explicit quantitative estimates are derived to deal with general d. Since Proposition 1 shows that B(x) is invertible, a straightforward exercise in calculus conveys that the unique solutionΦ(x) of weighted regression problem in (12) can be written as: where the optimal vector-valued weights w y (x) ∈ R D are given by: Then, defining the first coordinates of these vectors as the real-valued interpolation weights: the local polynomial interpolator of order l for g : [h, 1 − h] d → R can be written as the following weighted sum of the interpolation points {g(y) : y ∈ G m }: In Section 3.2, we present some important properties of the interpolation weights given in (17) based on the key estimate in Proposition 1.
The ensuing theorem conveys that local polynomial regression can be used to uniformly approximate a function g (in the supremum norm sense) for sufficiently large uniform grids G m .
Theorem 1 (Supremum Norm Interpolation) Fix any constant δ ∈ (0, 1), and consider the function g : where c ≥ b > 0 are the bounds on our kernel, and Λ(d, l) is defined in (14). Then, the supremum norm approximation error of g is bounded by δ, viz.
whereφ is the local polynomial interpolator of order l for g given in (18), and the bandwidth h ∈ 0, 1 2 is chosen to be Theorem 1 is proved in Section 3.3 using the pertinent properties of the interpolation weights in (17) derived in Section 3.2. We note that the lower bound on m is expected to scale at least exponentially with the dimension d due to the curse of dimensionality, cf. [17, Section 7.1]. However, it may be possible to improve the dependence we have above using a different analysis. In the sequel, we will exploit Theorem 1 to perform inexact gradient descent.

Inexact Gradient Descent using Local Polynomial Interpolation
We now turn to describing and analyzing a new inexact gradient descent algorithm that performs local polynomial interpolation at every iteration. In particular, we delineate the algorithm in Section 2.2.1, and derive convergence rates in Section 2.2.2.

Description of Algorithm
Under the setup and assumptions described in Section 1, suppose we seek to solve the optimization problem (4) in the ǫ-approximate sense of (8) based on some training data D with the parameter h ′ > 0 given by which is the upper bound on the bandwidth h in Theorem 1. 1 We first present our proposed algorithm, dubbed local polynomial interpolation based gradient descent (LPI-GD), in general, and then specialize the values of different problem variables with judicious choices later on. Let θ (t) ∈ R p denote the "approximate solution" produced by LPI-GD at iteration t ∈ Z + , and let δ t ∈ (0, 1) denote the maximum allowable supremum norm approximation error at iteration t ∈ N. Fix any bounded kernel K : with c ≥ b > 0 as in Section 2.1, and corresponding to each δ t at iteration t ∈ N, consider the uniform grid G mt , where m t ∈ N is given by Theorem 1: and the bandwidth h t ∈ 0, 1 2 is also given by Theorem 1: We initialize θ (0) ∈ R p to be any arbitrary vector in R p at iteration t = 0. Then, at iteration t ∈ N, we make |G mt | = m d t first order oracle calls at all virtual data points in the uniform grid G mt with the approximate solution at iteration t − 1: For every i ∈ [p], these oracle calls give us information about the partial derivatives {g i (y; θ (t−1) ) : y ∈ G mt } with respect to θ i . Hence, akin to (12) and (18) in Section 2.1, we can construct the local polynomial interpolatorφ where the interpolation weights w * y,t (x) ∈ R are defined analogously to (17); in particular, they are the first elements of the weight vectors w y,t (x) ∈ R D , which are defined as in (16), but with G m replaced by the grid G mt and h replaced by the bandwidth h t . Using the learned interpolators in (23) at iteration t, we next approximate the true gradient of the ERM objective function ∇ θ F (θ (t−1) ) at θ (t−1) as follows. For every i ∈ [p] and every j ∈ [n], we first estimate the partial derivative Finally, we perform the inexact gradient descent update which produces the approximate solution for iteration t. We run the above procedure for T ∈ N iterations, and the LPI-GD algorithm returns θ (T ) ∈ R p as the ǫ-approximate solution for the optimization problem (4). The analysis in Section 2.2.2 elucidates the specific values taken by the constants δ t and the total number of iterations T in terms of the other problem parameters. In particular, we will see that the approximation errors δ t , grid sizes m t , bandwidths h t , and uniform grids G mt all do not change over iterations. A pseudocode summary of the LPI-GD algorithm with pertinent δ t and T (selected as in Section 2.2.2) is presented in Algorithm 1; the algorithm uses the training data D and a first order oracle O to generate an ǫ-approximate solution that solves (4). We remark that Algorithm 1 portrays a 'theoretical' algorithm, because it assumes that problem parameters, such as the Lipschitz constant L 1 , the strong convexity parameter µ, the minimum value F * , and the Hölder exponent η and constant L 2 , are known to the practitioner. When some or all of these parameters are unknown, we can use appropriate bounds and estimates based on the loss function f : [0, 1] d × R p → R. For example, we may use any upper bound on L 1 , any lower bound on µ, and any universal lower bound on f (which yields a lower bound on F * ). Moreover, for smooth functions f , we can choose any desired level of differentiability η and any upper bound on L 2 . When such bounds or estimates are also unavailable, the best we can do is to select T , δ t , m t , and h t to have the correct scaling with respect to ǫ, d, p, and η. It is worth mentioning that our objective in this paper is to demonstrate that the oracle complexities of GD and SGD can be theoretically beaten by the proposed algorithm. So, we do not delve further into considerations regarding its practical implementation here.
Before proceeding to our analysis, it is worth juxtaposing Algorithm 1 with standard batch GD and SGD. At each iteration, GD exactly computes gradients for all data samples using n oracle calls. Since this can be computationally expensive, inexact gradient descent methods attempt to reduce the number of oracle calls per iteration at the cost of increasing the number of iterations. For example, SGD makes only one oracle call per iteration for a randomly chosen data sample, and uses this queried gradient value as a proxy for the true gradient. In contrast, at each iteration, our approach evaluates the gradients at certain representative virtual points in the data space, and uses these oracle calls to learn a model for the gradient oracle. This learnt model is then used to approximate the true gradient. Intuitively, when the loss function is a sufficiently smooth function of the data, we can learn an accurate model with a reasonably small number of virtual points.

Analysis of Algorithm
To compute the oracle complexity of the LPI-GD algorithm, we must first compute the minimum number of iterations it takes to obtain an ǫ-approximate solution to (4). The ensuing proposition chooses appropriate values of δ t , which turn out to be constant with respect to t, and develops an upper bound on the minimum number of iterations T .

16:
Update: where F : R p → R is the ERM objective function in (3), and σ = L 1 /µ > 1 is the condition number defined in Section 1.3. Then, after T iterations of the LPI-GD algorithm described in Algorithm 1 in Section 2.2.1 with arbitrary initialization θ (0) ∈ R p and updates (25), the parameter vector θ (T ) ∈ R p is an ǫ-approximate solution to the optimization problem in (4) in the sense of (8): Proposition 2 is proved in Section 4.1 using Theorem 1 and some known ideas from the analysis of inexact gradient descent methods in [20]. Using Proposition 2, the ensuing theorem bounds the first order oracle complexity of the LPI-GD algorithm in Algorithm 1, denoted Γ (LPI-GD), which is the minimum number of oracle calls made by Algorithm 1 to obtain an ǫ-approximate solution (as defined in Section 1.4).
Theorem 2 is established in Section 4.2. Unlike the dimension-free oracle complexity bounds for batch GD and SGD, our bound in Theorem 2 depends on d and p. However, even though it may not be immediately obvious, the oracle complexity of LPI-GD scales much better than the oracle complexities of GD or SGD in many important regimes. We next elucidate a scaling law for the bound in Theorem 2 in the setting where n, d, p, η, ǫ −1 → ∞ and all other problem parameters are constant.
For comparison, recall that the first order oracle complexity of GD (with appropriately chosen constant step size) is bounded by [38,Theorem 2.1.15]: when the ERM objective function in (3) is strongly convex and has Lipschitz continuous gradient.
(These conditions follow from the assumptions in Section 1.3 as shown in the proof of Proposition 2 in Section 4.1. Note that each iteration of GD makes n first order oracle queries, and [38, Theorem 2.1.15] conveys that it takes O(log(ǫ −1 )) iterations for GD to produce an ǫ-approximate solution in the sense of (8).) The expression in (29) does not depend on d, p, η, and neglects all other constant parameters. Similarly, when the ERM objective function in (3) is strongly convex, has Lipschitz continuous gradient, and certain universal second moment bounds are satisfied, the first order oracle complexity of SGD (with linearly diminishing step sizes) is bounded by [36]: where the notion of ǫ-approximate solution in (8) is defined in expectation because the iterates of SGD are random variables. For simplicity, we will assume that the quantities d, p, η, ǫ −1 are all functions of the number of training samples n, and study the asymptotics as n → ∞. The ensuing proposition presents a scaling regime for LPI-GD which beats the oracle complexities (or more precisely, the oracle complexity bounds) of both GD and SGD shown above.

This implies that
Proposition 3 is proved in Section 4.3. Under its assumptions, the LPI-GD algorithm beats any Θ(poly(n)) oracle complexity when τ is chosen to be large enough. Hence, one can also calculate interesting regimes where LPI-GD scales better than more refined gradient-based algorithms such as accelerated GD [39], mini-batch SGD [14], variance reduced SGD [26], etc. We do not explicitly carry out these calculations for brevity. We close this section by making several remarks. Firstly, many of the refined gradient-based methods mentioned above lead to improvements in the dependence of oracle complexity bounds on the condition number. We do not consider or optimize the trade-off with condition number in our analysis; this is in many ways an orthogonal question. As indicated at the outset of this paper, our focus has been on beating GD and SGD's oracle complexities with regard to their dependence on n, ǫ, d, and p.
Secondly, while methods like SGD or mini-batch SGD improve upon GD's oracle complexity (in terms of its trade-off between n and ǫ), they are stochastic algorithms that only generate ǫapproximate solutions on expectation. In contrast, the LPI-GD algorithm can beat GD and is completely deterministic.
Thirdly, since the LPI-GD algorithm has to learn the oracle at every iteration, the computational (or "arithmetical") complexity of its iterations may seem to be higher than usual iterative methods. However, the motivation for focusing on the concept of oracle complexity in optimization theory is partly that computing gradients can be extremely hard for certain functions. So, the computational complexity of the oracle usually dominates any other reasonable computation done during iterations. We, therefore, stick to this canonical paradigm in this work [38]. If the precise running time of the first order oracle O is known, then the complete computational complexity of the LPI-GD algorithm can be derived from Proposition 2 and Theorem 2 by accounting for the running time of the D × D matrix inversions performed at every iteration.
Fourthly, we emphasize that Proposition 3 holds in the important regime where α ≤ 1 and β > 1. The former condition ensures that SGD is favorable to GD (since lim n→∞ Γ * (SGD)/Γ * (GD) = 0), which is often the case in practice. The latter condition ensures that the number of parameters p can be much large than the number of training samples n, which corresponds to the popular overparametrized regime considered in the context of deep neural networks, cf. [42].
Finally, we note that the dependence of the bounds in Theorem 2 and Proposition 3 on both p and d could potentially be improved. The dependence on p arises because Theorem 1 holds for functions with codomain R. However, the proof of Proposition 2 in Section 4.1 needs to bound the ℓ 2 -norm error between (24) and the true gradient ∇ θ F (θ (t−1) ) for t ∈ N. Translating the coordinatewise guarantee to an ℓ 2 -norm guarantee in this proof introduces the dependence on p (see (60)). If some version of Theorem 1 could be proved for functions with codomain R p , then we could possibly improve the dependence of Theorem 2 on p.
On the other hand, the dependence on d could be improved by tightening the minimum eigenvalue lower bound in Lemma 1 as explained in Section 3.1. Nevertheless, this eigenvalue lower bound must decay at least exponentially (and correspondingly, the grid size parameter m must grow at least exponentially) in d, as noted after Theorem 1 in Section 2.1. Thus, it is intuitively straightforward to see that the best scaling of d we can hope for, so that the conclusion of Proposition 3 remains true, is d = O(log(n)). This logarithmic scaling of data dimension d with number of training samples n is significant. Indeed, for many data science applications of interest, the useful information in a dataset of n high-dimensional feature vectors is contained within the pairwise (Euclidean) distances between different feature vectors. In these scenarios, standard low-dimensional approximate isometric embedding theorems, e.g., the Johnson-Lindenstrauss lemma (cf. [12] and references therein), ensure that the feature vectors can be mapped to a d = O(log(n)) dimensional space such that the pairwise distances are preserved. We leave the development of our results along these directions as future work.

Proofs for Local Polynomial Interpolation
In this section, we provide proofs for the results presented in Section 2.1. Specifically, we first establish Proposition 1, then portray some properties of the interpolation weights in (17), and finally prove Theorem 1.

Proof of Proposition 1
To prove Proposition 1, we require the following intermediate lemma which generalizes [44, Lemma 1.4] to our d-variate setting.

Lemma 1 (Auxiliary Löwner Lower Bound) When
where U : R d → R D is defined in (11), and the constant Λ(d, l) > 0 is defined in (14).
Proof We begin by establishing some simple facts about B. Following the proof of [44,Lemma 1.4] mutatis mutandis, note that B is positive semidefinite:

Suppose for the sake of contradiction that there exists a non-zero vector
Then, we must have v T U (u) = 0 almost everywhere (with respect to the Lebesgue measure) on [−1, 1] d . However, since the non-zero polynomial u → v T U (u) is analytic, its set of roots has Lebesgue measure zero (see, e.g., [27,Section 4.1]). This is a contradiction. So, we actually have v T Bv > 0 for all non-zero vectors v ∈ R D . Thus, B is positive definite. From hereon, our proof deviates greatly from that of [ where the second equality uses (31) where L k is a polynomial with degree k. 2 It is well-known that the family of Legendre polynomials satisfies the orthonormality property [8, Chapter I, Exercise 1.6]: Moreover, for each k ∈ Z + , the leading coefficient (of t k ) of L k (t) is known to be [8, Chapter V, Equation (2.7)]: Next, for any r ∈ Z d + , define the Legendre product polynomial : which has degree r (with abuse of notation). It is straightforward to verify that these product polynomials also satisfy an orthonormality condition: where we utilize (34). Moreover, for each r ∈ Z d + , the leading coefficient (of u r ) of L r (u) is known to be d j=1 where we utilize (35). Due to (37), we know that for any r ∈ Z d + , the set {L s : s ∈ Z d + , s ≤ r} is linearly independent and spans the set of all d-variate polynomials with degree at most r. (Note that we use the lexicographical order over indices in Z d + here. Furthermore, it is straightforward to verify that all d-variate monomials with non-zero coefficients that appear in L s have degree at most s.) As a result, defining the vector(-valued) function L : R d → R D as: which is indexed in lexicographical order, we have the linear relation: for some fixed matrix R ∈ R D×D (that does not depend on u). It is evident that this matrix R is lower triangular due to the lexicographical order we have imposed. Furthermore, it is easy to derive the diagonal entries of R. Indeed, for any index s ∈ Z d + with |s| ≤ l, equating the leading coefficients on both sides of (40) produces 2s j + 1 2 2s j s j using (38). Rearranging this, we get: Now notice that where the second equality follows from (40), and the third equality follows from (37). We recognize (42) where the second equality follows from (41), the third inequality follows from the following bound (on Catalan numbers; see [16,Theorem]): the fifth equality follows from adding the exponents |s| of 2 over all s ∈ Z d + with |s| ≤ l and then squaring, the sixth inequality follows from the Stirling's formula bound (see, e.g., [18, Chapter II, Section 9, Equation (9.15)]): 3 the seventh inequality follows from the straightforward bound: which uses the fact that l ≥ d, the eighth inequality uses the following basic facts and calculations: 1. The hypercube maximizes the volume over all d-orthotopes with an isoperimetric constraint: 4 max s1,...,s d ≥0: where j ∈ Z + and the maximum is achieved by s 1 = · · · = s d = j d ; 2. For any j ∈ Z + and any s ∈ Z d + with |s| = j, the vector (j, 0, . . . , 0) ∈ Z d + majorizes the vector s, and since the map 0 ≤ t → 2t log(t + 1) is convex, Karamata's majorization inequality yields (cf. [29, Chapter 1, using the monotonicity of the exponential function; and the tenth inequality follows from the simple bound: which uses the fact that l ≥ d.
Finally, recall the following well-known result, cf. [30, Lemma 1]: which utilizes the positive definiteness of B established earlier. 5 Combining (32), (43), and (44), we obtain where the third inequality holds because d ≥ 19, the map 1 ≤ t → t t is strictly increasing, and we utilize the ensuing standard bounds on binomial coefficients: and the final equality follows from (14). Since B λ min (B)I D , applying the lower bound in (45) completes the proof.

⊓ ⊔
We make three pertinent remarks at this point. Firstly, we note that when d ≤ l ≤ γd for some constant γ > 1, one can verify that Hence, Λ(d, l) −1 scales double exponentially in d: This observation will be used in the proof of Proposition 3 in Section 4.3. Secondly, we note that the (crude) scaling with d of the lower bound on λ min (B) deduced in (47) can also be derived by an alternative known approach. Define the constant C ((2l + 1)!) d (l!) 2 , and notice that CB is an non-negative integer-valued matrix due to (31). Indeed, for any r, s ∈ Z d + with |r|, |s| ≤ l, we have CB r,s = ½{r + s is even entry-wise} 2 d l(l − 1) · · · (|r| + 1)l(l − 1) · · · (|s| + 1) which belongs to Z + , because |r|!/r! and |s|!/s! are multinomial coefficients, and the denominator divides the numerator in each term of the product. Since CB is both positive definite and nonnegative integer-valued, its determinant is a positive integer using the Leibniz formula [ As in the proof of Lemma 1, combining (32), (44), and (49), we obtain Letting l = Θ(d) and applying Stirling's approximation to the above factorial term, one can verify that the lower bound in (50) also scales double exponentially in d. Despite the simplicity of this argument, we prefer the analysis of det(B) in the proof of Lemma 1, because it exactly evaluates det(B) before lower bounding it, thereby illustrating why the derived scaling of det(B) with d and l is not loose. Moreover, the lower bound on λ min (B) in (45) is generally tighter than that in (50). Thirdly, we also conjecture that the scaling in (47) can be significantly improved to something like O(d Θ(poly(d)) ). Indeed, using the Courant-Fischer-Weyl min-max theorem [22,Theorem 4.2.6] (also see the Schur-Horn theorem [22,Theorem 4.3.45]), the minimum eigenvalue of B is upper bounded by the minimum diagonal entry of B as follows: where the second inequality follows from setting s 1 = l and s 2 = · · · = s d = 0 (which is reasonable because it asymptotically achieves the minimum), and the third inequality follows from the Stirling's formula bound (cf. [18, Chapter II, Section 9, Equation (9.15)]). Thus, λ min (B) −1 has the potential to scale as πl 2l+2 when l = Θ(poly(d)). We believe that this scaling is likely to be close to the actual scaling of λ min (B) −1 . Since we exactly evaluate the trace and determinant in the proof of Lemma 1, the looseness of our minimum eigenvalue lower bound in Lemma 1 stems from the AM-GM based result in (44). Hence, a fruitful future direction would be to obtain tighter bounds on the minimum eigenvalue in terms of trace and determinant that are still sufficiently tractable so as to permit analytical evaluation. We now establish Proposition 1 using Lemma 1.
Proof (of Proposition 1) We take inspiration from the proof technique of [44, Lemma 1.5], but the details of our analysis are more involved and quite different. Observe that for all m ∈ N, for all where the first equality follows from (13), the second equality follows from the substitution y = x+hz and the fact that the kernel K has support [−1, 1], the third inequality holds because the kernel K is lower bounded by b > 0 on [−1, 1], and we defineB(x) as the D × D symmetric positive semidefinite matrixB Then, applying the Courant-Fischer-Weyl min-max theorem [22, Theorem 4.2.6], we obtain We proceed to lower bounding λ min B (x) . Fix any x ∈ [h, 1 − h] d , and any r, s ∈ Z d + with |r|, |s| ≤ l, and consider the Riemann sum Therefore, we have shown that lim m→∞B (x) r,s = B r,s when h = ω(m −1 ). We will require an explicit upper bound on the rate of this convergence in the sequel. To this end, let us first compute the Lipschitz constant of γ with respect to the ℓ ∞ -norm. For any i ∈ [d] and any fixed (u 1 , . . . , Hence, via repeated application of the triangle inequality, we have where the second inequality follows from the triangle inequality, the fourth inequality follows from (54), the fifth inequality holds because |{z ∈ 1] can have at most 2mh+1 points at distance (mh) −1 from each other), and the sixth inequality holds because we assume that m ∈ N and h ∈ 0, 1 2 are such that mh ≥ 4l(3e) d /Λ(d, l) ≥ 1. This implies that for all such m and h, where the fourth equality follows from the multinomial theorem, and the fifth inequality uses the Maclaurin series of the exponential function. Next, recall the following consequence of the Wielandt-Hoffman-Mirsky inequality, cf. [22,Corollary 6.3.8]: Then, combining (55) and (56) produces the lower bound and employing Lemma 1 to this bound yields where Λ(d, l) is defined in (14). Finally, since m and h satisfy mh ≥ 4l(3e) d /Λ(d, l), or equivalently, using (53) and (57), we get Since B(x) λ min (B(x))I D , the above lower bound on λ min (B(x)) completes the proof. ⊓ ⊔

Properties of Local Polynomial Interpolation Weights
To prove Theorem 1, we will require two useful lemmata regarding the interpolation weights in (17). The first lemma generalizes [44,Proposition 1.12] to the d-variate setting, and demonstrates the intuitively pleasing property that polynomials are reproduced exactly by the local polynomial interpolator given in (18). where the interpolation weights are given in (17).
Proof We follow the proof of [44, Proposition 1.12] mutatis mutandis. Fix any m ∈ N and x ∈ [h, 1 − h] d such that B(x) is positive definite, and define the vector q(x) ∈ R D as Then, for every y ∈ G m , we have where the first equality uses Taylor's theorem (cf. [2,Theorem 12.14]) and the fact that g is a polynomial with total degree at most l, and U is defined in (11). This implies that the solution to the weighted regression problem in (12) satisfieŝ where the second equality follows from (13), and the third equality holds because B(x) is positive definite. Therefore, by considering the first coordinate, we get where the first equality uses (18). This completes the proof.

⊓ ⊔
The second lemma generalizes [44,Lemma 1.3] to the d-variate setting, and portrays an ℓ 1 -norm bound on the sequence of interpolation weights in (17).
where c ≥ b > 0 are the bounds on our kernel, and Λ(d, l) is defined in (14).
Proof Once again, we follow the proof of [ where the first inequality follows from (17) and the Cauchy-Schwarz inequality, the second equality follows from (16), the third inequality holds because the kernel K is upper bounded by c > 0 on the support [−1, 1], the fourth inequality follows from Proposition 1, the fifth inequality follows from the bound which uses (11), the fact that |y j − x j | ≤ h for all j ∈ [d], the multinomial theorem, and the Maclaurin series of the exponential function, the sixth inequality holds because the interval [−h, h] can have at most 2mh + 1 points at distance m −1 from each other, and the seventh inequality follows from the assumption that mh ≥ 4l(3e) d /Λ(d, l) ≥ 1. This completes the proof. ⊓ ⊔

Proof of Theorem 1
Finally, we are in a position to prove the supremum norm interpolation guarantee in Theorem 1.
Proof (of Theorem 1) Suppose that m ∈ N and h ∈ 0, 1 2 are such that mh ≥ 4l(3e) d /Λ(d, l). Then, observe that for every where the first equality follows from (18), the second equality holds because y∈Gm w * y (x) = 1 which follows from Lemma 2 applied to the constant unit polynomial, the third equality follows from Taylor's theorem with Lagrange remainder term [2, Theorem 12.14]: which holds for some τ ∈ (0, 1), and the fourth equality holds because y∈Gm (y − x) s w * y (x) = 0 for all s ∈ Z d + with 1 ≤ |s| ≤ l, which follows from Lemma 2 applied to the polynomials u → (u−x) s . This implies that where the first inequality follows from the triangle inequality, the second inequality follows from the proof of Proposition 4 in Appendix A, the third equality follows from (16) and (17) since the kernel K has support [−1, 1], the fourth inequality follows from Lemma 3, and the fifth equality holds because we set which minimizes the upper bound in the fourth inequality and achieves equality in the condition imposed on m and h at the outset of the proof. Lastly, to ensure that sup x∈[h, which is equivalent to which in turn is implied by the condition where we use the facts that η > l ≥ d ≥ 19 (by assumption) and Λ(d, l) (η+1)/η ≥ Λ(d, l) 2 (since Λ(d, l) ≥ 1), and we slacken the Λ(d, l) term because this does not change its double exponential nature. Moreover, we may further simplify (59) by noting that where we use the Stirling's formula bound (cf. [ where θ (t) for t ∈ Z + are the inexact gradient descent updates given by (25) with an arbitrary for t ∈ N are our local polynomial interpolation based approximations of the true gradients ∇ θ F (θ (t−1) ) shown in (24), and σ = L 1 /µ ≥ 1 is the condition number defined in Section 1.3.
We remark that [20, Lemma 2.1] additionally assumes that the infimum F * in (4) can be achieved. However, this condition need not be imposed in Lemma 4, since if F is strongly convex on R m , its infimum F * is finite and achieved by some unique global minimizer θ * ∈ R p . We next prove Proposition 2 using Lemma 4 and Theorem 1.
Proof (of Proposition 2) We first verify that the conditions of Lemma 4 hold. Recall the ERM objective function F : R p → R given in (3): , F is continuously differentiable and ∇ θ F : R p → R p is also L 1 -Lipschitz continuous. Indeed, observe that for all θ 1 , θ 2 ∈ R p , where we use the triangle inequality. Next, we apply Lemma 4 and get where θ (t) for t ∈ Z + are our updates in (25), and ∇F (t) for t ∈ N are our approximations of the true gradients in (24). To upper bound the squared ℓ 2 -norms on the right hand side of this inequality, notice that for any t ∈ N, we have where the first inequality follows from (24) and the triangle inequality,φ (t) i (x) denotes the output of our local polynomial interpolator for ∂f ∂θi (·; θ (t−1) ) when evaluated at x (cf. (23)), and the third inequality follows from Theorem 1 and our choices of grid sizes (20) and bandwidths (21) in Section 2.2.1 (where (19) ensures that h t ≤ h ′ ). Hence, we obtain We now select appropriate values for the constants {δ t ∈ (0, 1) : t ∈ N}. For every t ∈ N, choose as indicated in (27). This produces where the last equality computes the value of the previous geometric series. Finally, to ensure that θ (T ) is an ǫ-approximate solution of (4), viz.
F (θ (T ) ) − F * ≤ ǫ as defined in (8), it suffices to impose the condition By taking logarithms of both sides, and rearranging and simplifying the resulting inequality, we obtain the equivalent inequality where we also use the fact that σ = L 1 /µ. Therefore, after running the LPI-GD algorithm for T ∈ N iterations with T given by (26), the updated parameter vector θ (T ) forms an ǫ-approximate solution to (4). We also note that for every t ∈ N, by substituting (26) into our choice of δ t above. This yields (27), and completes the proof. ⊓ ⊔

Proof of Theorem 2
We next establish Theorem 2 using Proposition 2.
Proof (of Theorem 2) Since we choose the maximum allowable approximation errors {δ t ∈ (0, 1) : t ∈ [T ]} to be constant with respect to t as in (27) , where the second inequality follows from (20) and (27) in Proposition 2, the third inequality follows from (26) in Proposition 2, the fourth inequality follows from straightforward manipulations and the fact that σ = L 1 /µ, and the fifth inequality holds because we have assumed that η > d. The LPI-GD algorithm makes |G m | oracle calls in each of its T iterations and produces an ǫ-approximate solution θ (T ) ∈ R p (in the sense of (8)) as argued in Proposition 2. Hence, the oracle complexity of this algorithm is upper bounded by which use the fact that τ > max{1, α −1 }. This completes the proof. ⊓ ⊔

Conclusion
We conclude by recapitulating our main contributions and reiterating some directions of future research. In order to exploit the smoothness of loss functions in data for optimization purposes, we proposed the LIP-GD algorithm (see Algorithm 1) to compute approximate solutions of the ERM problem in (3) and (4), where the loss function is strongly convex and smooth in both the parameter and the data. We then derived the iteration and oracle complexities of this algorithm, and illustrated that its oracle complexity beats the oracle complexities of both GD and SGD for a very broad range of dependencies between the parameter dimension p, the approximation accuracy ǫ, and the number of training samples n, when the data dimension d is sufficiently small. Finally, in the course of our convergence analysis, we also provided a careful and detailed analysis of multivariate local polynomial interpolation with supremum norm guarantees, which may be of independent interest in non-parametric statistics. We close our discussion by stating three avenues for future work. Firstly, as stated at the end of Section 2 and in Section 3.1, it would be very valuable to improve the dependence of Theorem 2 on the data dimension d by sharpening the minimum eigenvalue lower bound in the key technical estimate in Lemma 1. Secondly, the dependence of Theorem 2 on the parameter dimension p could be improved by establishing new interpolation guarantees for approximating smooth R p -valued functions using local polynomial regression or other methods. Lastly, as noted in Section 1, while our work focuses on the strongly convex loss function setting, similar analyses could be carried out for other conventional scenarios, e.g., when f (x; ·) : R p → R is convex, or when f (x; ·) : R p → R is non-convex and ǫ-approximate solutions to (4) are defined via approximate first order stationary points.

A Taylor Approximation of Hölder Class Functions
In this appendix, we present a useful upper bound on the Taylor approximation error for functions residing in Hölder classes. While such results are known in the literature (see, e.g., [46,ter "Density Estimation" Equation (12), Chapter "Nonparametric Regression" Equation (2)]), the ensuing proposition is adapted to our specific setting. where l = ⌈η⌉ − 1, and we utilize multi-index notation.
Proof By Taylor where the second inequality follows from the triangle inequality, the third inequality follows from the Hölder class assumption (see (6)) and the fact that τ ∈ (0, 1), and the last equality follows from the multinomial theorem. This completes the proof. ⊓ ⊔