A comparison of deep networks with ReLU activation function and linear spline-type methods

Deep neural networks (DNNs) generate much richer function spaces than shallow networks. Since the function spaces induced by shallow networks have several approximation theoretic drawbacks, this explains, however, not necessarily the success of deep networks. In this article we take another route by comparing the expressive power of DNNs with ReLU activation function to piecewise linear spline methods. We show that MARS (multivariate adaptive regression splines) is improper learnable by DNNs in the sense that for any given function that can be expressed as a function in MARS with $M$ parameters there exists a multilayer neural network with $O(M \log (M/\varepsilon))$ parameters that approximates this function up to sup-norm error $\varepsilon.$ We show a similar result for expansions with respect to the Faber-Schauder system. Based on this, we derive risk comparison inequalities that bound the statistical risk of fitting a neural network by the statistical risk of spline-based methods. This shows that deep networks perform better or only slightly worse than the considered spline methods. We provide a constructive proof for the function approximations.


Introduction
Training of DNNs is now state of the art for many different complex tasks including detection of objects on images, speech recognition and game playing. For these modern applications, the ReLU activation function is standard. One of the distinctive features of a multilayer neural network with ReLU activation function (or ReLU network) is that the output is always a piecewise linear function of the input. But there are also other well-known nonparametric estimation techniques that are based on function classes built from piecewise linear functions. This raises the question in which sense these methods are comparable.
We consider methods that consist of a function space F and an algorithm/learner/ estimator that selects a function in F given a dataset. In the case of ReLU networks, F is given by all network functions with fixed network architecture and the chosen algorithm is typically a version of stochastic gradient descent. A method can perform badly if the function class F is not rich enough or if the algorithm selects a bad element in the class. The class of candidate functions F is often referred to as expressiveness of a method.
Deep networks have a much higher expressiveness than shallow networks, cf. [13,5,18,21,12,16]. Shallow ReLU networks have, however, several known drawbacks. It requires many nodes/units to localize and to approximate for instance a function supported on a smaller hypercube, cf. [3]. A large number of parameters is moreover necessary to approximately multiply inputs. While the literature has mainly focused on the comparison deep versus shallow networks, it is necessary to compare deep networks also with methods that have a similar structure but do not suffer from the shortcomings of shallow networks.
There is a vast amount of literature on spline based approximation, cf. [1], [2], [4], [14]. For a direct comparison with ReLU networks it is natural to focus on spaces spanned by a piecewise linear function system. In this article we compare deep ReLU networks to well-known spline methods based on classes of piecewise linear functions. More specifically, we consider multivariate adaptive regression splines (MARS) [7] and series expansion with respect to the Faber-Schauder system [6,19].
MARS is a widely applied method in statistical learning [11]. Formally, MARS refers to the algorithm describing the selection of a function from a specific class of piecewise linear candidate functions called the MARS function class. Similarly as multilayer neural networks, the MARS function class is build from a system of simple functions and has width and depth like parameters. In contrast to DNNs, MARS also uses tensorization to generate multivariate functions. The Faber-Schauder functions are the integrals of the Haar wavelet functions and are hence piecewise linear. Series expansion with respect to the Faber-Schauder system results therefore in a piecewise linear approximation. Both MARS and Faber-Schauder class satisfy an universal approximation theorem, see Section 2.2 for a discussion.
We show that for any given function that can be expressed as a function in the MARS function class or the Faber-Schauder class with M parameters there exists a multilayer neural network with O(M log(M/ε)) parameters that approximates this function up to sup-norm error ε. We also show that the opposite is false. There are functions that can be approximated up to error ε with O(log(1/ε)) network parameters but require at least of order ε −1/2 many parameters if approximated by MARS or the Faber-Schauder class.
The result provides a simple route to establish approximation properties of deep networks. For the Faber-Schauder class, approximation rates can often be deduced in a straightforward way and these rates then carry over immediately to approximations by deep networks. As another application, we prove that for nonparametric regression, fitting a DNN will never have a much larger risk than MARS or series expansion with respect to the Faber-Schauder class. Although we restrict ourselves to two specific spline methods, many of the results will carry over to other systems of multivariate splines and expansions with respect to tensor bases.
The proof of the approximation of MARS and Faber-Schauder functions by multilayer neural network is constructive. This allows to run a two-step procedure which in a first step estimates/learns a MARS or Faber-Schauder function. This is then used to initialize the DNN. For more details see Section 4.1.
Related to our work, there are few results in the literature showing that the class of functions generated by multilayer neural networks can be embedded into a larger space. [22] proves that neural network functions are also learnable by a carefully designed kernel method. Unfortunately, the analysis relies on a power series expansion of the activation function and does not hold for the ReLU. In [23], a convex relaxation of convolutional neural networks is proposed and a risk bound is derived. This paper is organized as follows. Section 2 provides short introductions to the construction of MARS functions and the Faber-Schauder class. Notation and basic results for DNNs are summarized in Section 3. Section 4 contains the main results on approximation of MARS and Faber-Schauder functions by DNNs. In Section 5, we derive risk comparison inequalities for nonparametric regression. Numerical simulations can be found in Section 6. Additional proofs are deferred to the end of the article.
Notation: Throughout the paper, we consider functions on the hypercube [0, 1] d and · ∞ refers to the supremum norm on [0, 1] d . Vectors are displayed by bold letters, e.g. X, x.

Spline type methods for function estimation
In this section, we describe MARS and series expansions with respect to the Faber-Schauder system. For both methods we discuss the underlying function space and algorithms that do the function fitting. The construction of the function class is in both cases very similar. We first identify a smaller set of functions to which we refer as function system. The function class is then defined as the span of this function system.

MARS -Multivariate Adaptive Regression Splines
Consider functions of the form with I ⊆ {1, . . . , d} an index set, s j ∈ {−1, 1} and t = (t j ) j∈I ∈ [0, 1] |I| a shift vector. These are piecewise linear functions in each argument whose supports are hyperrectangles. Suppose we want to find a linear combination of functions h I,t that best explains the data. To this end, we would ideally minimize a given loss over all possible linear combinations of at most M of these functions, where M is a predetermined cut-off parameter. Differently speaking, we search for an element in the following function class.
All functions are defined on on [0, 1] d . For notational convenience, the dependence on the constant F is omitted.
For regression, the least squares loss is the natural choice. Least-squares minimization over the whole function space is, however, computationally intractable. Instead, MARS -proposed by [7] -is a greedy algorithm that searches for a set of basis functions h I,t that explain the data well. Starting with the constant function, in each step of the procedure two basis functions are added. The new basis functions are products of an existing basis function with a new factor (x j − t j ) + , where t j coincides with the j-th component of one of the design vectors. Among all possible choices, the basis functions that lead to the best least-squares fit are selected. The algorithm is stopped after a prescribed number of iterations. It can be appended by a backwards deletion procedure that iteratively deletes a fixed number of selected basis functions with small predictive power. Algorithm 1 provides additional details.
In the original article [7] the values for t j are restricted to a smaller set. The least squares functional can be replaced by any other loss. The runtime of the algorithm is of the order (M ) 3 nd, where the key step is the computation of the argmin which is easily parallelizable. [8] considers a variation of the MARS procedure where the runtime dependence on M is only quadratic.
There are many higher-order generalization of the MARS procedure. E.g., [7] proposed also to use truncated power basis functions Input : Integer M and sample (X i , Y i ), i = 1, . . . , n Output: 2M + 1 basis functions of the form (1) . . , n} of the least squares fits for the models where i( · ) is a mapping {1, . . . , K} → {1, . . . , d}. The theoretical results of this paper can be generalized in a straightforward way to these types of higher-order basis functions. If the MARS algorithm is applied to the function class generated by (2), we refer to this as higher order MARS (HO-MARS).

Faber-Schauder series expansion
The Faber-Schauder system defines a family of functions [0, 1] d → R that can be viewed as the system of indefinite integrals of the Haar wavelet functions. Faber-Schauder functions are piecewise linear in each component and expansion with respect to the Faber-Schauder system results therefore in (componentwise) piecewise linear approximations.
Proof. It is well-known that the Haar wavelet generates an orthonormal basis of L 2 [0, 1].

The result follows by integrating both sides together with the fact that
In the univariate case, expansion with respect to the Faber-Schauder system is the same as piecewise linear interpolation on a dyadic grid. On the lowest resolution level, the function f is approximated by f (0)+(f (1)−f (0))x which interpolates the function at x = 0 and x = 1. If we go up to the j-th resolution level, the Faber-Schauder reconstruction linearly interpolates the function f on the dyadic grid x = k/2 j+1 , k = 0, 1, . . . , 2 j+1 . The Faber-Schauder system is consequently also a basis for the continuous functions equipped with the uniform norm. In neural networks terminology, this means that an universal approximation theorem holds.
Proof. By induction on d, Now, we can expand each of the partial derivatives with respect to the Haar wavelet Because the Haar wavelet functions are piecewise constant, the coefficients d λ can also be expressed as a linear combination of function values of f. In particular, we recover . Similarly as in the univariate case, the multivariate Faber-Schauder system interpolates the function on a dyadic grid. Given that It can be shown that is also piecewise linear in each component, T f coincides with the linear interpolating spline for dyadic knot points. Another consequence is that the universal approximation theorem also holds for the multivariate Faber-Schauder class on [0, 1] d .

Definition 2.4 (Faber-Schauder class).
For any non-negative integer M and any constant C > 0 the d-variate Faber-Schauder class truncated at resolution level M is defined as We further assume that The cardinality of the index set Λ M is (1+2 M +1 ). The class F FS (M, C) therefore consists of functions with many parameters. Every function Ψ λ 1 ·. . .·Ψ λ d can be represented as a linear combination of 3 d MARS functions. This yields the following result.
Lemma 2.5. Let I be defined as in (3). Then, for any M, C > 0, A consequence is that also the MARS functions satisfy an universal approximation theorem.
Reconstruction with respect to the Faber-Schauder class: Given a sample (X i , Y i ), i = 1, . . . , n, we can fit the best least-squares Faber-Schauder approximation to the data by solving The coefficients can be equivalently obtained by the least-squares reconstruction in the The matrix X is sparse because of disjoint supports of the Faber-Schauder functions. The least squares estimate is β = (X X) −1 X Y. If X X is singular or close to singular, we can employ Tikhonov regularization/ ridge regression leading to β α = (X X+αI) −1 X Y with I the identity matrix and α > 0. For large matrix dimension, direct inversion is computationally infeasible. Instead one can employ stochastic gradient descent which has a particularly simple form and coincides for this problem with the randomized Kaczmarz method, [15].

Deep neural networks (DNNs)
Throughout the article, we consider deep ReLU networks. This means that the activation function is chosen as σ(x) = (x) + = max(x, 0), x ∈ R. In computer science, feedforward DNNs are typically introduced via their representation as directed graphs. For our purposes, it is more convenient to work with the following equivalent algebraic definition.
The number of hidden layers or depth of the multilayer neural network is denoted by L.
The width of the network is described by a width vector p = (p 0 , . . . , A network is then any function of the form with W ∈ R p ×p −1 and v ∈ R p . For convenience, we omit the composition symbol " • " and write f ( To obtain networks that are real-valued functions, we must have p L+1 = 1.
During the network learning, regularization is induced by combining different methods such as weight decay, batch normalization and dropout. A theoretical description of these techniques seems, however, out of reach at the moment. To derive theory, sparsely connected networks are commonly assumed which are believed to have some similarities with regularized network reconstructions. Under the sparsity paradigm it is assumed that most of the network parameters are zero or non-active. The number of active/nonzero parameters is defined as where |W | 0 and |v | 0 denote the number of non zero entries of the weight matrices W and shift vectors v . The function class of all networks with the same network architecture (L, p) and network sparsity s can then be defined as follows. Network initialization is typically done by sampling small parameter weights. If one would initialize the weights with large values, learning is known to perform badly ([10], Section 8.4). In practice bounded parameters are therefore fed into the learning procedure as a prior assumption via network initialization. We have incorporated this in the function class by assuming that all network parameters are bounded by a universal constant which for convenience is taken to be one here.
Fitting a DNN to data is typically done via stochastic gradient descent of the empirical loss. The practical success of DNNs is partially due to a combination of several methods and tricks that are build into the algorithm. An excellent treatment is [10].
The number of required network parameters s needed to approximate a Faber-Schauder function with I parameters is thus also of order I up to a logarithmic factor.
The proofs of Lemma 4.1 and Lemma 4.2 are constructive. To find a good initialization of a neural network, we can therefore run MARS in a first step and then use the construction in the proof to obtain a DNN that generates the same function up to an error ε. This DNN is a good initial guess for the true input-output relation and can therefore be used as initialization for any iterative method.
There exists a network architecture which allows to approximate the product of any two numbers in [0, 1] at an exponential rate with respect to the number of network parameters, cf. [21,17]. This construction is the key step in the approximation theory of DNNs with ReLU activation functions and allows us to approximate the MARS and Faber-Schauder function class. More precisely, we consider the following iterative extension to the product of r numbers in [0, 1]. Notice that a fully connected DNN with architecture (L, p) has L =0 (p + 1)p +1 − p L+1 network parameters.
The total number of network parameters s is bounded by 42r 2 (1 + (N + 5) log 2 r ).

Lower bounds
In this section, we show that MARS and Faber-Schauder class require much more parameters to approximate the functions f (x) = x 2 and f (x 1 , x 2 ) = (x 1 + x 2 − 1) + . This shows that the converse of Lemma 4.1 and Lemma 4.2 is false and also gives some insights when DNNs work better.
The case f (x) = x 2 : This is an example of a univariate function where DNNs need much fewer parameters compared to MARS and Faber-Schauder class.
(iii) There exists a neural network h with O(log(1/ε)) many layers and O(log(1/ε)) many network parameters such that Part (iii) follows directly from Lemma 4.3. For (iv) one should notice that the number of hidden layers as defined in this work corresponds to L−2 in [21]. For each of the methods, Theorem 4.4 allows us to find a lower bound on the required number of parameters to achieve approximation error at least ε. For MARS and Faber-Schauder class, we need at least of the order ε −1/2 many parameters whereas there exist DNNs with O(log(1/ε)) layers that require only O(log(1/ε)) parameters.
Parts (i) and (ii) rely on the following classical result for the approximation by piecewise linear functions, which for sake of completeness is stated with a proof. Similar results have been used for DNNs by [21] and [12], Theorem 11.
Proof of Lemma 4.5. We prove this by contradiction assuming that there exists a function h * ∈ H(K) with h * − f ∞ < 1/(8K 2 ). Because h * consists of at most K pieces, there exist two values 0 ≤ a < b ≤ 1 with b − a ≥ 1/K such that h * is linear on the interval [a, b]. Using triangle inequality, This is a contradiction and the assertion follows. Depending on the application this can indeed be a big disadvantage. To describe an instance where this happens, consider a dataset with input being all pixels of an image. The machine learning perspective is that a canonical regression function depends on characteristic features of the underlying object such as edges in the image. Edge filters are discrete directional derivatives and therefore linear combinations of the input. This can easily be realized by a neural network but requires far more parameters if approximated by MARS or the Faber-Schauder class.
In particular, f (·, x 2 ) is piecewise linear in [a, b] with one kink, and f (a, . (ii) Similar to (i) with a = 0 and b = 2 −M −1 .

Risk comparison for nonparametric regression
In nonparametric regression with random design the aim is to estimate the dependence of a response variable Y ∈ R on a random vector X ∈ [0, 1] d . The dataset consists of n independent copies (Y i , X i ) i=1,...,n generated from the model We assume that f 0 : [0, 1] d → [0, F ] for some F > 0 and that the noise variables ε i are independent, standard normal and independent of (X i ) i . An estimator or reconstruction method will typically be denoted by f . The performance of an estimator is measured by the prediction risk with X having the same distribution as X 1 but being independent of the sample (Y i , X i ) i .
The cross-entropy in this model coincides with the least-squares loss, see [10], p.129. A common property of all considered methods in this article is that they aim to find a function in the respective function class minimizing the least squares loss. Gradient methods for network reconstructions will typically get stuck in a local minimum or a flat saddle point. We define the following quantity that determines the expected difference between the loss induced by a reconstruction method f taking values in a class of networks F and the empirical risk minimizer over F, The quantity ∆ n ( f n , f 0 , F) allows us to separate the statistical risk induced by the selected optimization method. Obviously ∆ n ( f n , f 0 , F) ≥ 0 and ∆ n ( f n , f 0 , F) = 0 if f n is the empirical risk minimizer over F. For convenience, we write ∆ n := ∆ n f n , f 0 , F(L, p, s) .

Finite sample results
This section provides a simulation-based comparison of DNNs, MARS and expansion with respect to the Faber-Schauder system. We also consider higher-order MARS (HO-MARS) using the function system (2). In particular, we study frameworks in which the higher flexibility of networks yields much better reconstructions compared to the other approaches. For MARS we rely on the py-earth package in Python. DNNs are fitted using the Adam optimizer in Keras (Tensorflow backend) with learning rate 0.001 and decay 0.00021.
As it is well-known the network performance relies heavily on the initialization of the network parameters. In the settings below, several standard initialization schemes resulted in suboptimal performance of deep networks compared with the spline based methods. One reason is that the initialization can 'deactivate' many units. This means that we generate w, v such that for every incoming signal x occurring for a given dataset, (w t x + v) + = 0. The gradients for w, v are then zero as well and the unit remains deactivated for all times reducing the expressiveness of the learned network. For onedimensional inputs with values in [0, 1], for instance, the j-th unit in the first layer is of the form (w j x + v j ) + . The unit has then the kink point in x = −v j /w j . The unit is deactivated, in the sense above, if −v j /w j > 1 and w j > 0 or if −v j /w j < 0 and w j < 0.
To increase the expressiveness of the network the most desirable case is that the kink falls into the interval [0, 1]. In view of the inequalities above this requires that v j and w j do not have the same sign and |w j | ≥ |v j |. A similar argument holds also for the deeper layers.
The Glorot/Xavier uniform initializer suggested in [9] initializes the network by setting all shifts/biases to zero and sampling all weights on layer independently from a uniform distribution U [−b , b ] with b = 6/(m −1 + m ) and m the number of nodes in the -th layer. This causes the kink to be at zero for all units. To distribute the kink location more uniformly, we also sample the entries of the -th shift vector independently from a uniform distribution It is also possible to incorporate shape information into the network initialization. If the weights are generated from a uniform distribution U [0, b ], all slopes are non-negative. This can be viewed as incorporating prior information that the true function increases at least linearly. In view of the argument above, it is then natural to sample the biases from U [−b , 0]. We call this the increasing Glorot initializer.
As described above, it might still happen that the initialization scheme deactivates many units, which might lead to extremely bad reconstructions. For a given dataset, we can identify these cases because of their large empirical loss. As often done in practice, for each dataset we re-initialize the neural network several times and pick the reconstruction with the smallest empirical loss.
The case f 0 (x) = x 2 : In the previous section, we have seen that MARS and Faber-Schauder class have worse approximation properties for the function f 0 (x) = x 2 if compared with DNNs. It is therefore of interest to study this as a test case with simulated independent observations from the model Since f 0 (x) = x 2 is smooth, a good approximation should already be provided for a relatively small number of parameters in any of the considered function classes. Thus, we set M = 10 for MARS and K = 3, M = 10 for higher order MARS allowing for polynomials of degree up to three. For the Faber-Schauder class F F S (M, C) we set M = 3 meaning that details up to size 2 −3 can be resolved. For the neural network reconstruction, we consider dense networks with three hidden layers and five units in each hidden layer. The network parameters are initialized using the increasing Glorot initializer described above. The simulation results are summarized in Table 1, see also Figure 2. Concerning the prediction risk, all methods perform equally well, meaning that the effect of the additive noise in model (8) dominates the approximation properties of the function classes. We find that higher order MARS often yields a piecewise linear approximation of the regression function and does not include any quadratic term.
The theory suggests that for squared regression function, DNNs should also perform better than Faber-Schauder. We believe that DNNs do not find a good local minimum in most cases explaining the discrepancy between practice and applications. For further improvements of DNNs a more refined network initialization scheme is required.
The case f 0 (x) = sin (5·2πx) : To compare the different methods for rough functions, we simulate independent observations from For such highly oscillating functions, all methods require many parameters. Therefore we study MARS with M = 50, higher order MARS with K = 5 and M = 50, the Faber-Schauder class with M = 6 and fully-connected networks with ten hidden layers, network width vector p = (1, 10, . . . , 10, 1) and Glorot uniform initializer. Results are displayed in Table 2 and Figure 3. Intuitively, fitting functions generated by a local function system seems to be favorable if the truth is highly oscillating and the Faber-Schauder series expansion turns out to perform best for this example.
The case f 0 (x 1 , x 2 ) = (x 1 + x 2 − 1) + : As shown in Lemma 4.6, DNNs need much fewer parameters than MARS or the Faber-Schauder class for this function. We generate independent observations from Notice that the regression function f 0 is piecewise linear with change of slope along the line For convenience we only consider the more flexible higher order MARS. The prediction risks reported in Table 3 and the reconstructions displayed in Figure 4 show that in this problem neural networks outperform higher order MARS and reconstruction with respect to the Faber-Schauder class even if we allow for more parameters.
HO-MARS FS DNN 2.20 · 10 −4 (7 · 10 −5 ) 9.86 · 10 −5 (9 · 10 −6 ) 2.46 · 10 −7 (1 · 10 −6 ) Table 3: Average prediction risks in model (12) based on 100 repetitions. Standard deviations in brackets. Dependence on input dimension: In this simulation, dependence of the risk on the input dimension is studied. We generate independent observations from This function is constructed in such a way that it can be well-approximated by both higher order MARS and neural networks. Independently of the dimension d the setup for higher order MARS is K = 10 and M = 30. We choose a densely connected neural network with architecture L = 15, p = (d, 10, . . . , 10, 1) and Glorot uniform initializer. The prediction risks for d = 1, 2, 5, 10 are summarized in Table 4. The results show that both methods suffer from the curse of dimensionality. Neural networks achieve a consistently smaller risk for all dimensions.
The risks are summarized in Table 5. The obtained values are comparable with the results in the last row of Table 4 showing that both neural network and higher order MARS reconstructions are adaptive with respect to the active parameters of the model.
such that H m,N − h m ∞ ≤ 3 d 2 −N . The first hidden layer of the network H m,N builds the factors (s j (x j − t j )) + for the input x ∈ R d . The remaining d − |I| terms of the first layer are set to one. This can be achieved by choosing the first weight matrix to be W 1 = (s j 1(i = j, i ∈ I)) i,j∈{1,...,d} and the corresponding shift vector as v 1 = (s j t j 1(j ∈ I) − 1(j / ∈ I)) j , where 1(·) denotes the indicator function.
Because of (s j (x j − t j )) + ∈ Finally, we describe the construction of a network h that approximates the function f = β 0 + M m=1 β m h m with all coefficients bounded in absolute value by C. For that we first include a layer that maps (h 1 , . . . , h M ) to (1, h 1 , . . . , h M ) using M + 1 active parameters. Every component is then multiplied with C by parallelized sub-networks with width 2, 2 log 2 C − 1 hidden layers and 4 log 2 C parameters each. It is straightforward to construct such a subnetwork.
Proofs of Theorem 5.1 and Theorem 5.2: We derive an upper bound for the risk of the empirical risk minimizer using the following oracle inequality, which was proven in [17]. Let N (δ, F, · ∞ ) be the covering number, that is, the minimal number of · ∞ -balls with radius δ that covers F (the centers do not need to be in F). f ∈F E f (X) − f 0 (X) 2 + ∆ n + F 2 14 log N n + 20 nγ + 25δF .