Deep ReLU networks and high-order finite element methods II: Chebyshev emulation

We show expression rates and stability in Sobolev norms of deep feedforward ReLU neural networks (NNs) in terms of the number of parameters defining the NN for continuous, piecewise polynomial functions, on arbitrary, finite partitions $\mathcal{T}$ of a bounded interval $(a,b)$. Novel constructions of ReLU NN surrogates encoding function approximations in terms of Chebyshev polynomial expansion coefficients are developed which require fewer neurons than previous constructions. Chebyshev coefficients can be computed easily from the values of the function in the Clenshaw--Curtis points using the inverse fast Fourier transform. Bounds on expression rates and stability are obtained that are superior to those of constructions based on ReLU NN emulations of monomials as considered in [Opschoor, Petersen and Schwab, 2020] and [Montanelli, Yang and Du, 2021]. All emulation bounds are explicit in terms of the (arbitrary) partition of the interval, the target emulation accuracy and the polynomial degree in each element of the partition. ReLU NN emulation error estimates are provided for various classes of functions and norms, commonly encountered in numerical analysis. In particular, we show exponential ReLU emulation rate bounds for analytic functions with point singularities and develop an interface between Chebfun approximations and constructive ReLU NN emulations.


Introduction
The use of Deep Neural Networks (DNNs for short) as approximation methods for the numerical solution of Partial Differential Equations (PDEs for short) has received considerable attention in recent years.We mention only the "PiNNs" and the "Deep Ritz" methodologies, and their variants.
One question which arises in this context is the DNN "expressivity", which could be labeled as approximation power of the DNN with specific architecture (comprising e.g.depth, width and activation function, but not restriction on connectivity), for particular sets of functions which arise as solution components ("solution features" in the parlance of deep learning) of PDEs in science and in engineering.
In recent years, numerous results on analytic and approximation properties of deep ReLU NNs have appeared.We only provide an incomplete review, with particular attention to results of direct relevance to the present paper.Previous works focused on particular classes of univariate functions, e.g [1].It has also been observed that DNNs, in particular the presently considered ReLU activated DNNs, can emulate large classes of well-established approximation architectures, such as refinable functions (wavelets) (see, e.g., [4] and the references there), splines and high-order polynomial functions [21].See also the survey [5] and the references there.Recently, also rather wide classes of Finite-Element approximation spaces on regular, simplicial partitions of polytopal domains in R d have been emulated by ReLU DNNs [16].Further results, not of direct relevance here, on approximation rate bounds for deep ReLU NNs include complexity bounds in terms of the number of affine pieces [14], expression rate bounds for parametric transport [15], and high-dimensional classification (see [25] and the references there).

Contributions
The main technical contributions of the present paper are improved bounds on the emulation error and neural network complexity (as compared to the first part of this work [21]) of deep ReLU NN emulations of polynomials and continuous, piecewise polynomial functions.Emulations by deep ReLU NNs which are developed in the proofs of the main results are constructive, based on point evaluations and on (piecewise) Chebyšev expansions.Improvements with respect to previous work [21] in terms of stability, DNN size vs. accuracy, and efficient construction are realized.
Specifically, the presently proposed, constructive ReLU NN emulation of a given function g ∈ C 0 ([−1, 1]) uses the nodal interpolation in the Clenshaw-Curtis nodes in [−1 , 1].Specifically, denote the Lagrangian polynomial interpolant of g of degree p ∈ N by gp (this corresponds to Π cc,p [g] in the notation of Section 2.4).The main difference of the presently proposed construction with respect to [21] is the use of FFTbased methods for Chebyšev interpolation.This allows for fast, numerical construction of the ReLU emulation of a given continuous function, and for favorable stability properties of the construction: the sum of the absolute values of the Chebyšev coefficients of a polynomial is much smaller than that of its Taylor coefficients.Enhanced numerical stability of Chebyšev polynomial based DNNs has recently been reported in [33].

Chebyšev expansions of polynomial interpolants
We elaborate on the error and the complexity of interpolation and assume for simplicity that g admits a Chebyšev expansion g = ∞ j=0 aj Tj with absolutely summable Chebyšev coefficients (aj) ∞ j=0 .One sufficient condition for absolute summability is that the first derivative of g is of bounded variation.This is in particular the case for functions in the Sobolev space W 2,1 ((−1, 1)) and also for all functions realized by ReLU NNs.For such functions, it is shown in [35,Theorem 7.2] that the Chebyšev coefficients (aj ) j∈N satisfy |aj | ≤ CV j −2 , where V denotes the variation of the derivative.
Focusing in this discussion on error bounds in L ∞ ((−1, 1)) (see Section 1.1.3for the full range of treated Sobolev norms), the error g − gp L ∞ ((−1,1)) is determined by the smoothness of g, which can be expressed in terms of the decay of its Chebyšev coefficients.We denote gp = p j=0 cjTj , where the coefficients (cj ) p j=0 differ from (aj) p j=0 due to aliasing and can be computed from function values in the Clenshaw-Curtis nodes using the inverse fast Fourier transform, as will be discussed in Proposition 2.2 Item (iii).They depend linearly on the function g which is to be approximated and are given explicitly in terms of (aj) ∞ j=0 by [35,Theorem 4.2].The interpolation error can be estimated by where the last step used [35,Theorem 4.2].Thus, if for some α > 0 holds that |aj | ≤ Cj −1−α for all j ∈ N, then g − gp L ∞ ((−1,1)) ≤ Cp −α .As it is well-known, cf.e.g.[26] and [35,Sections 7 and 8], Sobolev smoothness of the represented function is characterized by the decay of the Chebyšev coefficients.For example, as mentioned before, α = 1 for functions whose first derivative is of bounded variation due to [35,Theorem 7.2], which includes functions in W 2,1 ((−1, 1)) and realizations of ReLU NNs.The moderate growth of the Lebesgue constant of the Clenshaw-Curtis nodes in terms of the polynomial degree p (with respect to the L ∞ ((−1, 1))-norm it is at most 2 π log(p + 1) + 1, see Proposition 2.2 Item (ii)) implies favorable numerical stability of Lagrange interpolation in the Clenshaw-Curtis nodes.
This allows us to build a polynomial up that is approximately equal to g by interpolating an approximation u of g whose Chebyšev coefficients are also absolutely summable, e.g. a ReLU NN approximation of g.It follows that i.e. in addition to the error g − gp there is a second term in which the approximation error g − u is magnified by at most a factor ( 2 π log(p + 1) + 1).
The computational complexity of interpolation includes at most O(p) function evaluations and an additional O(p(1 + log(p))) operations for the execution of the fast Fourier transform.
If g is the realization of a ReLU NN from Section 3.2, then the Chebyšev coefficients can simply be read off from the output layer network weights, exactly, and with complexity p + 1.
The ReLU NN emulation gp of the chebfun object gp is constructed in two steps.First, a ReLU DNN emulating the Chebyšev polynomials (Tj ) p j=1 is constructed; this is done in Lemma 3.9 below.We denote its outputs by ( Tj) p j=1 .These ReLU emulations of the Tj incur an error.For all 0 < δ < 1, the constructed ReLU NN emulating ( Tj) p j=1 with Tj − Tj L ∞ ((−1,1)) ≤ δ < 1 for all j = 1, . . ., p has a network size which we show to be bounded from above by C(p(1 + log(p)) + p log(1/δ)), with C > 0 independent of p ∈ N and of δ.The second step is to compute in the output layer the linear combination gp := p j=0 cj Tj.We have T0 ≡ 1, so we can define T0 := 1 and add the constant term c0 T0 as a bias in the output layer, without error.Similarly, the linear polynomial T1 can be emulated exactly, i.e.T1 := T1.The emulation error of gp can therefore be estimated by The favorable conditioning of the Chebyšev polynomials will allow us to show in Lemma 2.3 that )) (we actually will prove this for all polynomials of degree at most p), which implies that for all 0 < εp < 1 a relative emulation error tolerance εp ≥ gp −gp L ∞ ((−1,1)) / gp L ∞ ((−1,1)) can be achieved by setting δ = εp/p 4 .Substituting into the bound on the network size gives Choosing εp ≤ εp provides a ReLU NN emulation gp of the degree p Chebyšev expansion gp of g with pointwise accuracy furnished by the Chebfun algorithms, Furthermore, when the Chebyšev coefficients are known, the computational complexity for computing gp equals the size of the resulting NN, which is at most C(p(1 + log(p)) + p log(1/εp)).

Quantitative improvements over previous results
We compare our Chebyšev-based construction with the approach in [21], which was based on a monomial expansion gp = p j=0 tjx j , and a ReLU DNN emulation of the monomials (x j ) p j=1 .We denote its outputs by ( Xj ) p j=1 .For all 0 < δ < 1, the ReLU NN emulating ( Xj ) p j=1 constructed in [21] with x j − Xj L ∞ ((−1,1)) ≤ δ for all j = 1, . . ., p has a network size which is bounded by C(p(1+log(p))+p log(1/δ)).Again, the second step is to compute in the output layer a linear combination gM p := p j=0 tj Xj .The emulation error of gM p can be estimated by Bad conditioning of the monomials implies that there does not exist c > 0 such that for any integer p ≥ 2 and all polynomials of degree p, their Taylor coefficients can be estimated by p j=2 |tj| ≤ gp L ∞ ((−1,1)) p c .On the contrary, upper bounds on p j=2 |tj|/ gp L ∞ ((−1,1)) which hold uniformly for all polynomials of degree p generally grow exponentially with p.For example, for p ≥ 2 and g = Tp we have gp = Tp and tp = 2 p−1 , which means that p j=2 |tj| ≥ |tp| = 2 p−1 and thus p j=2 |tj | ≥ 2 p−1 gp L ∞ ((−1,1)) , which implies that a relative emulation error εp := gp − gM p L ∞ ((−1,1)) / gp L ∞ ((−1,1)) requires δ ≤ εp/2 p−1 .Substituting into the bound on the network size gives a term of the order Cp 2 .In [21], it was shown that there exist constants c, C > 0 such that it is sufficient to choose δ = εpCc p , resulting in the upper bound C(p 2 + p log(1/εp)) on the network size. 1  Compared to [21], we cover here a wider range of Sobolev spaces W s,r (I) for 0 ≤ s ≤ 1 and 1 ≤ r ≤ ∞ on arbitrary bounded intervals I = (a, b), keeping track of the natural scaling of Sobolev norms as a function of the interval length b − a.

Chebfun
Advantages of representing a function through its Chebyšev coefficients have also been leveraged in the socalled "Chebfun" software package, cf.http://www.chebfun.org.There too, functions are represented through Chebyšev coefficients of the interpolant in the Clenshaw-Curtis nodes, in a so-called "chebfun" object (cf.[36,Section 1.1]). 2 Chebfun has a broad range of functionalities and acts directly on the Chebyšev coefficients.
Our DNN emulation result in Proposition 3.11 allows in particular what could be called "Chebfun -Neural Network" interoperability via transfer of all functionalities of Chebfun to the presently considered ReLU NN context.Specifically: (i) Given a ReLU NN, a chebfun object can be created through interpolation (with the Chebfun function chebfun).In case the NN was constructed as in Proposition 3.11, then no interpolation is necessary.The Chebyšev coefficients can be read off directly from the output layer weights.
(ii) Any Chebfun functionality can be applied to the obtained chebfun object, including those functions whose output is again a chebfun object.
(iii) Our NN emulation results provide ReLU NN emulations of the output chebfun object.
Similarly, ReLU NN functionalities may enhance Chebfun via o o e.g., when the functions afford only low regularity in classical Hölder-or Besov spaces, but exhibit selfsimilar structure, as is the case e.g. with fractal function systems.We refer to the discussion in [4,Sec. 7.3.1]:for certain fractal function classes with low Sobolev regularity and dense singular support, deep ReLU NNs can still afford exponential approximation rates in terms of the NN size, whereas polynomial approximations furnished by Chebfun will converge only at low, algebraic rates in terms of the polynomial degree.

Layout
The structure of this text is as follows: In Section 2, we recapitulate classical facts from approximation theory, and from orthogonal polynomials.Section 2.2 introduces the spline spaces for which we prove 1 The exponential growth of Taylor coefficients holds more generally than for the example of Tp = g = gp.For the polynomial interpolant gp = p j=0 c j T j of any continuous function g, we can expand (T j ) p j=0 in their monomial expansions to obtain the monomial expansion gp = p j=0 t j x j .Using again that the p'th coefficient of Tp equals 2 p−1 and observing that the only term with the power x p comes from c j T j , we see that tp = 2 p−1 cp. 2 The label "Chebfun" (with capital C) refers to the software package [36], and chebfun, with lowercase c, to an object within this software (representing a piecewise polynomial function through its Chebyšev coefficients).emulation bounds.Section 2.3 formalizes certain scales of weighted function spaces on intervals I, whose members are piecewise smooth, up to possibly a finite number of point singularities at points in I. Section 2.4 recalls polynomial interpolation in the Clenshaw-Curtis points, its stability, and the expression of such interpolants in the basis of Chebyšev polynomials using the inverse fast Fourier transform.All these concepts are used to construct and analyze, in Section 3, ReLU DNN approximations of univariate polynomials and continuous, piecewise polynomial functions.Using these results to emulate h-, p-and hp-finite element methods provides us in Section 4 with DNN approximation rates.

Notation
We denote by C > 0 a generic, positive constant whose numerical value may be different at each appearance, even within an equation.The dependence of C on parameters is made explicit when necessary, e.g.
For p ∈ N0, we denote the space of polynomials of degree at most p by Pp = span {x j : j ∈ N0, j ≤ p}, with the convention P−1 := {0}.For an interval I ⊂ R, we will sometimes write Pp(I) to indicate that we consider polynomials as functions on I.For k ∈ N0, the univariate Chebyšev polynomial (of the first kind) normalized such that T k (1) = 1 is denoted by T k .With slight abuse of notation we write expressions of the form i (ai) r 1/r for r ∈ [1, ∞], where ai ∈ R for all i, by which we mean sup i ai in case r = ∞.Similarly, when writing we mean esssup x∈D |f (x)| in case r = ∞.
For vector spaces U, W and a bounded linear operator P : U → W , the operator norm of P is P U,W .
In connection with neural networks, we shall also invoke the realization R, parallelization P and the full parallelization FP in Section 3.

Weighted function spaces on I
We introduce finite order Sobolev spaces and analytic classes of Sobolev spaces in the open, bounded interval I, with and without weight.These spaces will be employed to describe the regularity of functions u in ReLU DNN emulations.

Finite order Sobolev spaces
For a bounded open interval I ⊂ R and u : I → R, for all k ∈ N we write and also, for all r ∈ [1, ∞) and all k ∈ N0, , and for r = 2 we write H k (I) := W k,2 (I).
Although we are mainly interested in real-valued functions, some of the error estimates are given in Lebesgue spaces of complex-valued functions defined on some open set E ⊂ C containing I. For such spaces, the codomain will be indicated explicitly, e.g.L ∞ (E ; C).

Finite order weighted Sobolev spaces
To allow for functions with point singularities, but otherwise smooth, we next introduce a continuous weight function defined in terms of a finite number of singular points.We will consider spaces of functions which may be singular in given points A1, . . ., AN sing ∈ R for Nsing ∈ N. The Ai will usually (but not always) be assumed to belong to I.For a weight sequence β = (β1, . . ., βN sing ) ∈ R N sing , we consider the weight function If Nsing = 1, we will write β instead of β.
For large weight exponents β > 0 in (2.4), the weight function ψ β is small close to the singular points Ai, i.e. the function u and its derivatives are allowed to grow strongly close to the singularity.For r ∈ [1, ∞) and m, ℓ ∈ N0 with m ≥ ℓ, the space W m,ℓ r,β (I) is defined as the closure of C ∞ (I) with respect to the weighted Sobolev norm The corresponding seminorm equals . For r = ∞, these norms are understood w.r. to the L ∞ (I)-and • ∞ -norms.In the Hilbertian case r = 2, we write H m,ℓ β (I) := W m,ℓ 2,β (I).

Finite order fractional Sobolev spaces
In Section 3.3 we will briefly comment on error bounds with respect to norms of fractional Sobolev spaces W s,r (I) for s ∈ (0, 1) and

Weighted Gevrey classes
Weighted Gevrey classes are sets of functions on I = (0, 1) which may be singular in x = 0, but are smooth (not necessarily analytic) away from 0. We refer to [8,3] and the references there for such spaces.They contain as particular cases also weighted analytic functions (which satisfy Equation (2.5) below for δ = 1).We define for δ ≥ 1, ℓ ∈ N0 and β ∈ (0, 1) the Gevrey class G ℓ,δ β (I) to be the set of functions u in k≥ℓ H k,ℓ β (I) for which there exist constants Cu, du > 0 such that Rates of their ReLU NN approximations will be studied in Section 4.3.

Polynomial interpolation in Clenshaw-Curtis points
Due to its central role in our NN constructions, we recall basic properties of interpolation in the Clenshaw-Curtis points, including stability bounds and how Chebyšev coefficients of the interpolant can be computed from function values in the Clenshaw-Curtis points using the inverse fast Fourier transform (IFFT).This is done in Proposition 2.2.In the proof of Proposition 3.10, we need estimates on the Chebyšev coefficients of polynomials in Pp((−1, 1)) in terms of the L ∞ ((−1, 1))-norm of the polynomial.These are provided in Lemma 2.3.
(ii) The first bound is stated e.g. in [35,Theorem 15.2].There also a brief history of this result is provided.The second estimate follows from the first one using Markov's inequality, Lemma (iii) This is shown in [26,Theorem 3.13].
To estimate the neural network approximation error in Section 3, it will be important to have a stability bound on the sum of the absolute values of the coefficients in Item (iii).In Lemma 2.3, we estimate the sum of the absolute values of Chebyšev coefficients of a polynomial in terms of its L ∞ ((−1, 1))-norm. (2.9) Proof.We use [34,Theorem 4.2] with k = 1 in the notation of the reference (note that slightly sharper estimates are given in [18, Theorem 2.1] and [35,Theorem 7.1]) ) .
For p = 1 both sides of the inequalities vanish.Using Markov's inequality (Lemma 2.1), we obtain Together with the previous estimate, this shows (2.9).

ReLU NN approximation of univariate functions
In this section we consider NNs with the ReLU activation function ̺ : R → R : x → max{0, x} for the approximation of univariate functions on bounded intervals.The ReLU NN approximation of continuous, piecewise polynomial functions allows us to transfer existing results on approximation by piecewise polynomial functions to obtain approximation results for ReLU NNs.We discuss several such approximation results in Section 4. Combined with [21, Section 6], these also give efficient approximations to high-dimensional, radially symmetric functions, as developed in [21, Sections 6.1-6.3].

ReLU neural network calculus
As usual (e.g.[24,21]), we define a NN as an L-tuple of weight-bias pairs.A NN realizes a function, called realization of the NN, as composition of parameter-dependent affine transformations and a nonlinear activation function ̺.We briefly recall the NN formalism from [21, Section 2].
Here ̺ acts componentwise on vector-valued inputs, ̺(y) = (̺(y1), . . ., ̺(ym)) for all y = (y1, . . ., ym) ∈ R m .We call L(Φ) := L the number of layers or depth of Φ, Mj (Φ) := Aj ℓ 0 + bj ℓ 0 the number of nonzero weights and biases in the j-th layer, and M (Φ) := L j=1 Mj (Φ) the number of nonzero weights or size of Φ.The number of nonzero weights and biases in the first layer is also denoted by M fi (Φ), and the number of those in the last layer also by M la (Φ).The first L − 1 layers, in which the activation function is applied, are called hidden layers.
We construct ReLU DNN emulations of Finite Element spaces (2.3) from certain fundamental building blocks using a calculus of ReLU NNs from [24].
The following two propositions introduce parallelizations in which, given two networks of equal depth, one network is constructed which in parallel emulates the realizations of the two given networks.In the first proposition, both subnetworks have the same input.In the second proposition, they have different inputs.
Both propositions contain bounds on the number of nonzero weights in the first and the last layer that can be derived from the definitions.Then there exists a network P(Φ 1 , Φ 2 ) with d-dimensional input and L layers, called parallelization of Φ 1 and Φ 2 , satisfying We will also use parallelizations of NNs that do not have the same inputs.Then there exists a ReLU NN, denoted by FP(Φ 1 , Φ 2 ), with d = d1 + d2-dimensional input and depth L, called full parallelization of Φ 1 and Φ 2 , such that for all We next recall the concatenation of two NNs.The matrix multiplications in the definition below may lead to an undesirably large network size, if the product of two small or sparse matrices is a large dense matrix (this applies to U 1 V L 2 in the definition below).For this reason, so-called sparse concatenations will be defined in Proposition 3.5.Definition 3.4 (Concatenation, [24, Definition 2.2]).Let Φ 1 , Φ 2 be two NNs such that the input dimension of Φ 1 equals the output dimension of Φ 2 , which we denote by k.For ℓ = 1, . . ., L1 := L(Φ 1 ) we denote the weight matrices and bias vectors of Φ 1 by {U ℓ } ℓ and {a ℓ } ℓ and for ℓ = 1, . . ., L2 := L(Φ 2 ) we denote those of Φ 2 by {V ℓ } ℓ and {b ℓ } ℓ .Then, we define the concatenation of Φ 1 and Φ 2 as , the input layer of Φ 1 has the same dimension as the output layer of Φ 2 .
Then, there exists a ReLU NN Φ 1 ⊙ Φ 2 , called the sparse concatenation of Φ 1 and Φ 2 , such that The following proposition summarizes the properties of one type of identity NNs based on the ReLU activation function.

ReLU emulation of polynomials
In this section, we construct ReLU NNs whose realizations emulate univariate polynomials of arbitrary degrees on a bounded interval.We closely follow the approach in [21, Section 4], again exhibiting explicitly the dependence of the DNN depth and size on the polynomial degree.Differently from the construction in [21], for n ∈ N we use Chebyšev polynomials of the first kind, denoted by {T ℓ } ℓ≤n , as basis for the space Pn of polynomials of degree n.As was derived for NNs with the RePU activation function ̺ r for r ∈ N satisfying r ≥ 2 in [33], and as we will see for ReLU NNs in Lemma 3.9 below, they can be approximated efficiently by exploiting the multiplicative three term recursion (which is distinct from the additive three-term recurrence relation and not available for other families of orthogonal polynomials on [−1, 1]) which is related to the addition rule for cosines. 4The advantage of using a Chebyšev basis instead of a monomial basis is that Chebyšev approximation is better conditioned.For example, for functions whose derivative is of bounded variation, the sum of the absolute values of its Chebyšev coefficients is bounded in terms of that variation ([35, Theorem 7.2]). 5This includes functions in W 2,1 ((−1, 1)), for which this variation of its derivative is bounded by the W 2,1 ((−1, 1))-norm, and realizations of ReLU NNs.The sum of the absolute values of the Chebyšev coefficients is often much smaller than the sum of the absolute values of the coefficients with respect to the monomial basis.The use of the better conditioned Chebyšev basis allows us to obtain bounds on the network depth and size with a better dependence on the polynomial degree than that in [21], because the analysis there used monomial expansions of Legendre polynomials, the coefficients of which in absolute value grow exponentially with the polynomial degree, cf.[21,Equation (4.13)] and the analysis preceding that equation.
The base result of this section is Proposition 3.8 on the emulation of polynomials on the reference interval Î := [−1, 1], which is exact in the endpoints of the interval.For the ReLU DNN approximation of univariate piecewise polynomial functions in Section 3.3, we will use NN approximations of polynomials on an arbitrary bounded interval, such that the NN realizations are continuous and constant outside the interval of interest.Such approximations are presented in Corollary 3.10.In both those results, the hidden layer weights and biases are independent of the approximated polynomials.The output layer weights and biases are the Chebyšev coefficients of the polynomial of interest.They can be computed easily from the values of the polynomial in the Clenshaw-Curtis points using the inverse fast Fourier transform, by Proposition 2.2 Item (iii).
The main building block for our approximations of piecewise polynomials by ReLU NNs is the efficient approximation of products introduced in [37].We here recall a version with W 1,∞ -error bounds from [28] in the notation of [21] and in addition recall in Equation (3.5) below from [20] that multiplication by ±1 is emulated exactly.
For all κ > 0 and δ ∈ (0, 1/2) In addition, for every and for every a We refer to [20, Proposition 6.2.1] for details of the proof, which is based on results proved in [37,28].The networks from [37, Propositions 2 and 3] and [28, Proposition 3.1] have a fixed width and variable depth.There exist networks with the same realization, whose width is variable independently of their depth, which are studied in [24] and a series of works following [29].
We now prove an analogue of [21, Proposition 4.2] on the NN approximation of polynomials on the reference interval Î = [−1, 1], based on the better conditioned Chebyšev expansion of the target polynomial.It also covers the straightforward extension to the approximation of N ∈ N polynomials by computing them as linear combinations of approximate Chebyšev polynomials, which are emulated only once.Furthermore, the only NN coefficients which depend on the approximated functions are the Chebyšev coefficients, which can be computed efficiently from point values in the Clenshaw-Curtis nodes, as noted in Proposition 2.2 Item (iii).
The next result has ReLU DNN expression rate bounds for polynomials of degree n given in terms of Chebyšev expansions.It has been proved in [ . ., Nv .The weights and biases in the hidden layers are independent of (vi) Nv i=1 .The weights and biases in the output layer are the Chebyšev coefficients {v i,ℓ : ℓ = 0, . . ., n and i = 1, . . ., Nv}, which are linear combinations of function values in the Clenshaw-Curtis points, as shown in Proposition 2.2 Item (iii).
The proof of Proposition 3.8 is based on Lemma 3.9 below.It proceeds along the lines and uses ideas in the proof of [21,Proposition 4.2].As in [21, Lemma 4.5], we compute the basis polynomials (in this case Chebyšev polynomials) using a binary tree of product networks introduced in Proposition 3.7.The main difference with [21] is that we use the recurrence relation (3.3).We use it for |m − n| ∈ {0, 1}, such that only one product network is required to compute Tm+n(x) from Tm(x), Tn(x) and T |m−n| (x) ∈ {1, x}.
The networks constructed in the proofs of Lemma 3.9 and Proposition 3.8 were already defined in [12, Appendix A].There, their depth, size and error have not been analyzed yet.
For the depth and the size of Ψ 1 δ we obtain Note that only the bound on M la (Ψ 1 δ ) gives the term C la in Equation (3.11).As we will see below, there is no term C la in the bound on M la (Ψ k δ ) for k > 1.With Proposition 3.7 the error can be bounded as where [D × δ/4,1 ] is the Jacobian, which is a 1×2-matrix and is treated as a row vector.The first inequality, Poincaré's inequality and T 2,δ (0) = −1 = T2(0) (which follows from Equation (3.4)) imply the second inequality.This shows Equation (3.6) for k = 1 and finishes the proof of the induction basis.
Exactness in the points ±1 follows from Equation (3.7) in Lemma 3.9.
The depth and the size of Φ v τ can be bounded, using that 2 k ≤ 2n: With the error bounds from Lemma 3.9 we obtain that for all i ∈ {1, . . ., Nv} Finally, the fact that the hidden layer weights and biases of Φ v τ are independent of (vi) Nv i=1 can be seen directly from its definition.For each i = 1, . . ., Nv, the Lagrange interpolant of vi of degree n in the Clenshaw-Curtis points (x cc,n j ) n j=0 equals vi by Proposition 2.2 Item (i), which means that 2.2 Item (iii) can be used to compute the Chebyšev coefficients of vi.
Preparing for the approximation of univariate, piecewise polynomial functions, we next derive as a corollary from Proposition 3.8 a NN approximation of multiple polynomials on a bounded interval I := [a, b] for arbitrary −∞ < a < b < ∞, such that on (−∞, a) the NN realization exactly equals the values of the polynomials in the left endpoint a, and on (b, ∞) equals the values of the polynomials in the right endpoint b.Also, we use the material from Section 2, in particular Lemma 2.3 bounding Chebyšev coefficients and the inverse inequalities from Section 2.1, to transform the right-hand side of the error bound in Proposition 3.8, which is stated in terms of Chebyšev coefficients, into a bound in terms of Sobolev norms of the approximated polynomial, taking into account the natural scaling of such norms in terms of the interval length h.
In the statement of the corollary, the minimum over v in Equation (3.19) expresses the fact that the error is not affected by adding a polynomial of degree 1 to the polynomials we want to approximate, because NNs with ReLU activation can emulate continuous, piecewise linear functions exactly.ε } ε∈(0,1) with input dimension one and output dimension Nv which satisfy for all i = 1, . . ., Nv, 1 ≤ r, r ′ ≤ ∞ and (2/h) The weights and biases in the hidden layers are independent of (vi) Nv i=1 .The weights and biases in the output layer are the Chebyšev coefficients, which are linear combinations of the function values in the Clenshaw-Curtis points in I.
Proof.The proof consists of 2 steps.In Step 1, we introduce a piecewise linear map which enables us to use Proposition 3.8 on the reference interval [−1, 1].We define it such that Equations (3.17) and (3.18) will be satisfied.In Step 2, we define the network Φ v,I ε and estimate its error, depth and size.
, which is the continuous, piecewise linear function which is affine on [a, b] and satisfies P (x) = −1 for all x ∈ (−∞, a] and P (x) = 1 for all x ∈ [b, ∞).We denote the inverse of its restriction to [a, b] by P With the inverse inequality (2.2) it follows that for all 1 ≤ r ′ < ∞ and v ∈ Pn, with v := v • P : for an absolute constant C0 independent of n, r ′ , I and v.We used that the function (x + 1) 1/x = exp( 1 x log(1 + x)) equals 2 in x = 1 and converges to 1 for x → ∞, hence by continuity it is bounded on [1, ∞).It gives Step 2. In this step, we define the network Φ v,I ε .
)) be a NN of depth 2 and size at most 7 which exactly emulates P .Let v = (vi) Nv i=1 be as defined in Step 1, and let Φ v τ for τ = ε/(4C0n 6 ) be the network from Proposition 3.8.Then, we define Equations (3.17) and (3.18) follow from the definition of P and exactness in ±1 of the network from Proposition 3.8.By the result of Step 1 of this proof, for t = 0, 1 the error is bounded as follows: ( Using exactness of Φ v,I ε in the endpoints of the interval, Hölder's inequality implies that for all i = 1, . . ., Nv, 1 ≤ r < ∞ and t = 0, 1 which is the first inequality in Equation (3.19).For all i = 1, . . ., Nv, taking v , which gives the second inequality in Equation (3.19).
It remains to estimate the network depth and size.We use Proposition 3.8 with Cn ≤ (n + 1)Nv and τ = ε/(4C0n 6 ), which gives The weights of Φ P are independent of (vi) Nv i=1 .The remaining statements about the NN weights follow directly from Proposition 3.8.

ReLU emulation of piecewise polynomial functions
Using Corollary 3.10, the dependence of the network size on the polynomial degree in [21, Proposition 5.1] is improved in Proposition 3.11 below.
Based on Proposition 3.11, improvements of some results in [21, Section 5] directly follow.We present them in Section 4 without proof, because the proofs are completely analogous to those in [21].Other results from [21, Section 5] do not improve, because for those results it holds that p ∝ (1 + log(1/ε)), which means that p 2 + p log(1/ε) and p log(p) + p log(1/ε) are of the same order.
Below, we derive a result on the ReLU DNN approximation of continuous, piecewise polynomial functions on an arbitrary bounded interval I, analogous to [21, Proposition 5.1].For continuous, piecewise polynomial functions we use the notation from Section 2.2.As for the ReLU emulation of polynomials in the previous section, weights and biases in the hidden layers are independent of the approximated function.For weights and biases in the output layer, explicit numerical expressions based on the inverse fast Fourier transform are provided in Remark 3.12.A quick extension of the error bounds to fractional order Sobolev spaces is given in Corollary 3.13.} ε∈(0,1) such that for all 1 ≤ r, r ′ ≤ ∞ holds for all i = 1, . . ., N and t = 0, 1, pi.
The first error bound in the proposition follows from Equation (3.21).For the second error estimate, we use that |v| W 1,r (I i ) ≤ |v| W 1,r (I i ) , which we next show, using Jensen's inequality in the second to last step: Using this fact, we obtain and by Poincaré's inequality .
The network depth and size can be estimated in terms of pmax := max N i=1 pi: pi.
The realization is exact in the points (xj) N j=0 because Φ v emulates v exactly and because R( j=0 for all i = 1, . . ., N .Given T , the hidden layer weights and biases are independent of v, because those of Φ v are independent of v (see the proof of [21,Lemma 3.1]) and those of R(Φ v−v,I i ε i ) are independent of v − v, as stated in Corollary 3.10.The weights and biases in the output layer can be computed as linear combinations of the function values of v in the Clenshaw-Curtis grids on the subintervals of T .Explicit formulas are given in Remark 3.12 below.
Remark 3.12.The weights of Φ v,T ,p ε in the output layer can be computed easily from the function values of v in the Clenshaw-Curtis grids on the subintervals of the partition T using the inverse fast Fourier transform.In this remark we give explicit formulas.
The weights in the final layer of Φ v are v(x 1 )−v(x 0 ) The bias equals v(x0).For i = 1, . . ., N , the weights and biases in the final layer of where (v − v)(x cc,p i I i ,j ) can be expressed in terms of function values of v as Because v|I i ∈ P1(Ii), the Chebyšev coefficients (wi) ℓ of (v − v)|I i of degree ℓ > 1 equal those of v. Therefore, for ℓ > 1, v − v can be replaced by v in (3.22a).
For ℓ = 0, 1, (wi) ℓ can be determined from (wi) ℓ ℓ>1 by writing the conditions in terms of the Chebyšev coefficients, using that T ℓ (±1) = (±1) ℓ for all ℓ ∈ N0.This gives With the theory of interpolation spaces, we directly obtain the following corollary of Proposition 3.11 when the error is measured in fractional Sobolev norms.

ReLU emulation of univariate finite element spaces
Based on Proposition 3.11, improved dependence on the polynomial degree of NN size and depth in several emulation rate bounds for univariate h-, p-and hp-Finite Element spaces from [21] directly follow.We state them in Sections 4.1, 4.2 and 4.3.The former two are stated, without detailing proofs.These are completely analogous to those of the corresponding results in [21], with improved dependence of NN size on the polynomial degree due to the emulation of Chebyšev polynomials.

Free-knot splines
The following analogue of [21,Theorem 5.4] obtains, for fixed parameters q, q ′ , t, s, s ′ , p, and up to a factor log(N ) in the network size, the classical adaptive "h-FEM" rate N −(s−s ′ ) , with a network depth of the order log(N ).Similar results, which also apply to multivariate functions, but are restricted to error bounds in L q , q ∈ (0, ∞], were obtained in [32, Proposition 1 and Theorem 1].We first state a spline approximation result.Combined with Proposition 3.11, we obtain ReLU NN approximations in Corollary 4.2.The proof of that the corollary is a straightforward application of the two propositions, analogous to the proof of [21,Theorem 5.4].Proposition 4.1 ([23, Theorems 3 and 6]).Let I = (0, 1), q, q ′ , t, t ′ , s, s ′ ∈ (0, ∞], p ∈ N, and q < q ′ , s < p + 1/q, s ′ < s − 1/q + 1/q ′ .Then, there exists a C3 := C(q, q ′ , t, t ′ , s, s ′ , p) > 0 and, for every N ∈ N and every f in B s q,t (I), for p = (p, . . ., p) ∈ N N , there exist a partition T of I with N elements and ϕ ∈ Sp(I, T ) such that Moreover, Then, there exists a constant C * (q, q ′ , t, s, s ′ , p) > 0 and, for every N ∈ N and every f ∈ B s q,t (I), there exists a NN Φ N f such that and The improvement with respect to [21,Theorem 5.4] is a better dependence of the network depth and size on the polynomial degree p, namely C(1 + log(p)) 3

Spectral methods
Corollary 4.4 provides ReLU NN emulation rate bounds for spectral and so-called "p-version Finite Element" methods.The ReLU NN constructions based on Chebyšev polynomials imply improved dependence of the NN size on the polynomial degree and superior numerical stability, in comparison to the construction in [21,Theorem 5.8].In particular, the ReLU NN size is now proportional to the number of degrees of freedom N p + 1, up to a polylogarithmic factor.In [32, Proposition 1 and Theorem 1] and [11, Corollary 4.2], similar statements with the same approximation rates were obtained, but based on a different argument.There, the approximation was based on the NN emulation of B-splines and of averaged Taylor polynomials, respectively, of fixed polynomial degree, and on partitions that are refined to reduce the error.The result below is based on the approximation of piecewise polynomials for which both the polynomial degree and the mesh can be refined to reduce the error.We observe that [32, Proposition 1 and Theorem 1] and [11, Corollary 4.2] apply more generally, in particular also to multivariate functions.
First, we state the finite element result on which Corollary 4.4 is based.where h = max i∈{1,...,N} hi is the maximum element size, defined in Section 2.2.In addition, |v| H 1 (I) ≤ |u| H 1 (I) .
Then, there is an absolute constant C > 0 and a constant c(s) > 0 (depending only on s) such that, for all p, N ∈ N there exists a NN Φ u,N,p such that for all s ∈ N0 satisfying s ≤ min{s, p} u − R(Φ u,N,p ) M fi (Φ u,N,p ) ≤ 2N + 2, M la (Φ u,N,p ) ≤ N (2p + 3) + 1.This result follows from Proposition 4.3 if we choose T to be the uniform partition of I into N intervals (which is the optimal choice of T ).The proof is analogous to that of [21,Theorem 5.8].We improve upon that result by achieving the same error with a smaller network.Up to logarithmic factors, the NN size is O(N p), for N → ∞ and/or p → ∞, rather than O(N p 2 ) in [21,Theorem 5.8].

Approximation of weighted Gevrey regular functions
We consider approximation rates of ReLU NNs via so-called "hp-approximations"7 of functions on I = (0, 1) which may have a singularity at x = 0 and which belong to a weighted Gevrey class G 2,δ β (I) defined in (2.5).The following theorem is a generalization of [27,Theorem 3.36], stated for the analytic case δ = 1, to Gevrey regular functions for δ ≥ 1.It is based on the geometric partition Tσ,N of I = (0, 1) introduced in Section 2.2.This hp-approximation result implies, via the ReLU emulation of Chebyšev polynomials, the following DNN expression rate bound.Proof.The proof is analogous to that of [21,Theorem 5.12]: an hp Finite Element approximation will be emulated by DNNs, and the triangle inequality will imply ReLU NN expression rate bounds.

Conclusions
The present paper extended and improved emulation rate bounds from [21] for deep ReLU NNs for various spaces of continuous, piecewise polynomial functions.Compared to [21], we covered a wider range of Sobolev spaces W s,r (I) for 0 ≤ s ≤ 1 and 1 ≤ r ≤ ∞ on arbitrary bounded intervals I = (a, b), keeping track of the natural scaling of Sobolev norms as a function of the interval length b − a.
While the present DNN emulation bounds addressed only deep ReLU NNs, we emphasize that the results do generalize also to other activations.In particular, due to the equivalence of ReLU NNs and so-called spiking networks which are prominent in neuromorphic computing (see [31] and [30,Thm. 3], and the references there) all presently proved NN emulation rate bounds will imply corresponding bounds for such networks.

ε
(xj) = v(xj) for all j ∈ {0, . . ., N }.The weights and biases in the hidden layers are independent of v.The weights and biases in the output layer are linear combinations of the function values in the Clenshaw-Curtis points in Ii for i = 1, . . ., N .Proof.As in the proof of [21, Proposition 5.1], we denote by v ∈ S1(I, T ) the continuous, piecewise linear interpolant of v, which is exact in the nodes of the partition, i.e. v(xi) = v(xi) for all i = 0, . . ., N .It can be emulated exactly by a NN of depth 2 satisfying M (Φ v ) ≤ 3N +1, M fi (Φ v ) ≤ 2N and M la (Φ v ) ≤ N +1, see e.g.[21, Lemma 3.1].On each interval Ii, i = 1, . . ., N , we approximate the difference v − v ∈ Sp(I, T ) with Φ v−v,I i ε i from Corollary 3.10 for εi = ε/4, with Nv = 1 in the notation of the corollary.It holds for all 1