Elsevier

Neural Networks

Volume 161, April 2023, Pages 242-253
Neural Networks

Simultaneous approximation of a smooth function and its derivatives by deep neural networks with piecewise-polynomial activations

https://doi.org/10.1016/j.neunet.2023.01.035Get rights and content

Highlights

  • Rates and complexity for smooth function approximation in Hölder norms by ReQU neural networks.

  • Explicit and uniform bounds for weights of the approximating neural network.

  • Exponential convergence rates for analytic functions.

Abstract

This paper investigates the approximation properties of deep neural networks with piecewise-polynomial activation functions. We derive the required depth, width, and sparsity of a deep neural network to approximate any Hölder smooth function up to a given approximation error in Hölder norms in such a way that all weights of this neural network are bounded by 1. The latter feature is essential to control generalization errors in many statistical and machine learning applications.

Introduction

Neural networks have recently gained much attention due to their impressive performance in many complicated practical tasks, including image processing (LeCun, Bengio, & Hinton, 2015), generative modeling (Goodfellow et al., 2014), reinforcement learning (Mnih et al., 2015), numerical solution of PDEs, e.g., Geist et al., 2021, Han et al., 2018, and optimal control (Chen et al., 2019, Onken et al., 2022). This makes them extremely useful in design of self-driving vehicles (Li, Ota, & Dong, 2018) and robot control systems, e.g., Bozek et al., 2020, Cembrano et al., 1994, González-Álvarez et al., 2022. One of the reasons for such a success of neural networks is their expressiveness, that is, the ability to approximate functions with any desired accuracy. The question of expressiveness of neural networks has a long history and goes back to the papers (Cybenko, 1989, Funahashi, 1989, Hornik, 1991). In particular, in Cybenko (1989), the author showed that one hidden layer is enough to approximate any continuous function f with any prescribed accuracy ɛ>0. However, further analysis revealed the fact that deep neural networks may require much fewer parameters than the shallow ones to approximate f with the same accuracy. Many efforts were put in recent years to understand the fidelity of deep neural networks. In a pioneering work (Yarotsky, 2017), the author showed that for any target function f from the Sobolev space Wn,([0,1]d) there is a neural network with O(ɛd/n) parameters and ReLU activation function, that approximates f within the accuracy ɛ with respect to the L-norm on the unit cube [0,1]d. Further works in this direction considered various smoothness classes of the target functions (Gühring and Raslan, 2021, Li et al., 2020a, Lu et al., 2021, Shen et al., 2021), neural networks with diverse activations (De Ryck et al., 2021, Gühring and Raslan, 2021, Jiao et al., 2021, Langer, 2021), domains of more complicated shape (Shen, Yang, & Zhang, 2020), and measured the approximation errors with respect to different norms (De Ryck et al., 2021, Gühring and Raslan, 2021, Schmidt-Hieber, 2020, Yarotsky, 2017). Several authors also considered the expressiveness of neural networks with different architectures. This includes wide neural networks of logarithmic (Gühring and Raslan, 2021, Schmidt-Hieber, 2020, Yarotsky, 2017) or even constant depth (De Ryck et al., 2021, Li et al., 2020a, Li et al., 2020b, Shen et al., 2021), or deep and narrow networks (Hanin, 2019, Kidger and Lyons, 2020, Park et al., 2021). Most of the existing results on the expressiveness of neural networks measure the quality of approximation with respect to either the L- or Lp-norm, p1. Much fewer papers study the approximation of derivatives of smooth functions. These rare exceptions include Gühring et al., 2020, Gühring and Raslan, 2021 and De Ryck et al. (2021).

In the present paper, we focus on feed-forward neural networks with piecewise-polynomial activation functions of the form σReQU(x)=(x0)2. Neural networks with such activations are known to successfully approximate smooth functions from the Sobolev and Besov spaces with respect to the L- and Lp-norms (see, for instance, Abdeljawad and Grohs, 2022, Ali and Nouy, 2021, Chen et al., 2022, Gribonval et al., 2022, Klusowski and Barron, 2018, Li et al., 2020a, Li et al., 2020b, Siegel and Xu, 2022). We continue this line of research and study the ability of such neural networks to approximate not only smooth functions themselves but also their derivatives. We derive the non-asymptotic upper bounds on the Hölder norm between the target function and its approximation from a class of sparsely connected neural networks with ReQU activations. In particular, we show that, for any f from a Hölder ball Hβ([0,1]d,H), H>0, β>2, (see Section 2 for the definition) and any ɛ>0, there exists a neural network with ReQU activation functions, that uniformly approximates the target function in the norms of the Hölder spaces H([0,1]d) for all {0,,β}. Here and further in the paper, β stands for the largest integer which is strictly smaller than β. A simplified statement of our main result is given below.

Theorem 1 simplified version of Theorem 2

Fix β>2 and p,dN. Then, for any H>0, for any f:[0,1]dRp, fHβ([0,1]d,H) and any integer K2, there exists a neural network hf:[0,1]dRp with ReQU activation functions such that it has O(logd+β+loglogH) layers, at most O(pd(K+β)d) neurons in each layer and O(p(dβ+d2+loglogH)(K+β)d) non-zero weights taking their values in [1,1]. Moreover, it holds that fhfH([0,1]d)CβdHβKβfor all {0,,β},where C is a universal constant.

We provide explicit expressions for the hidden constants in Theorem 2. The main contributions of our work can be summarized as follows.

  • Given a smooth target function fHβ([0,1]d,H), we construct a neural network, that simultaneously approximates all the derivatives of f up to order β with optimal dependence of the precision on the number of non-zero weights. That is, if we denote the number of non-zero weights in the network by N, then it holds that fhfH([0,1]d)=O(N(β)/d) simultaneously for all {0,,β}.

  • The constructed neural network has almost the same smoothness as the target function itself while approximating it with the optimal accuracy. This property turns out to be very useful in many applications including the approximation of PDEs and density transformations where we need to use derivatives of the approximation.

  • The weights of the approximating neural network are bounded in absolute values by 1. The latter property plays a crucial role in deriving bounds on the generalization error of empirical risk minimizers in terms of the covering number of the corresponding parametric class of neural networks. Note that the upper bounds on the weights provided in De Ryck et al., 2021, Gühring et al., 2020, Gühring and Raslan, 2021 blow up as the approximation error decreases.

The rest of the paper is organized as follows. In Section 2, we introduce necessary definitions and notations. Section 3 contains the statement of our main result, Theorem 2, followed by a detailed comparison with the existing literature. We then present numerical experiments in Section 4. The proofs are collected in Section 5. Some auxiliary facts are deferred to Appendix.

Section snippets

Norms.

For a matrix A and a vector v, we denote by A and v the maximal absolute value of entries of A and v, respectively. A0 and v0 shall stand for the number of non-zero entries of A and v, respectively. Finally, the Frobenius norm and operator norm of A are denoted by AF and A, respectively, and the Euclidean norm of v is denoted by v. For a function f:ΩRd, we set fL(Ω)=esssupxΩf(x),fLp(Ω)=Ωf(x)pdx1/p,p1. If the domain Ω is clear from the context, we simply write L or L

Approximation of functions from Hölder classes

Our main result states that any function from Hβ([0,1]d,H), H>0, β>2, can be approximated by a feed-forward deep neural network with ReQU activation functions in H([0,1]d), {0,,β}.

Theorem 2

Let β>2 and let p,dN. Then, for any H>0, for any f:[0,1]dRp, fHβ([0,1]d,H) and any integer K2, there exists a neural network hf:[0,1]dRp of the width (4d(K+β)d)12(K+2β)+1pwith 6+2(β2)+log2d+2log2(2dβ+d)log2log2H1hidden layers and at most p(K+β)dC(β,d,H) non-zero weights taking their values

Numerical experiments

In this section, we provide numerical experiments to illustrate the approximation properties of neural networks with ReQU activations. We considered a scalar function f(x)=sin(x12x2) of two variables and approximated it on the unit square [0,1]2 via neural networks with two types of activations: ReLU and ReQU. All the neural networks were fully connected, and all their hidden layers had a width 16. The first layer had a width 2. The depth of neural networks took its values in {1,2,3,4,5}. In

Proof of Theorem 2

Step 1. Let f=(f1,,fp). Consider a vector a=(a1,,a2β+K+1), such that a1==aβ+1=0;aβ+1+j=j/K,1jK1;aβ+K+1==a2β+K+1=1. By Theorem 3, there exist tensor-product splines Sfβ,K=(Sf,1β,K,,Sf,pβ,K) of order β2 associated with knots at {(aj1,aj2,,ajd):j1,,jd{1,,2β+K+1}} such that fSfβ,KH([0,1]d)maxm{1,,p}fmSf,mβ,KH([0,1]d)(2ed)βHKβ+9d(β1)(2β+1)2d+(2ed)βHKβ. Our goal is to show that Sfβ,K can be represented by a neural network hf with ReQU

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The publication was supported by the grant for research centers in the field of AI provided by the Analytical Center for the Government of the Russian Federation (ACRF) in accordance with the agreement on the provision of subsidies (identifier of the agreement 000000D730321P5Q0002) and the agreement with HSE University, Russia No. 70-2021-00139. Nikita Puchkin is a Young Russian Mathematics award winner and would like to thank its sponsors and jury. Denis Belomestny acknowledges the financial

References (41)

  • AbdeljawadA. et al.

    Approximations with deep neural networks in Sobolev time-space

    Analysis and Applications

    (2022)
  • AliM. et al.

    Approximation of smoothness classes by deep rectifier networks

    SIAM Journal on Numerical Analysis

    (2021)
  • BozekP. et al.

    Neural network control of a wheeled mobile robot based on optimal trajectories

    International Journal of Advanced Robotic Systems

    (2020)
  • CandèsE. et al.

    Fast computation of Fourier integral operators

    SIAM Journal on Scientific Computing

    (2007)
  • CandèsE. et al.

    A fast butterfly algorithm for the computation of Fourier integral operators

    Multiscale Modeling & Simulation. A SIAM Interdisciplinary Journal

    (2009)
  • ChenY. et al.

    Optimal control via neural networks: A convex approach

  • CybenkoG.

    Approximation by superpositions of a sigmoidal function

    Mathematics of Control, Signals, and Systems

    (1989)
  • GeistM. et al.

    Numerical solution of the parametric diffusion equation by deep neural networks

    Journal of Scientific Computing

    (2021)
  • González-Álvarez, M., Dupeyroux, J., Corradi, F., & de Croon, G. C. (2022). Evolved neuromorphic radar-based altitude...
  • GoodfellowI. et al.

    Generative adversarial nets

  • Cited by (10)

    • Theoretical guarantees for neural control variates in MCMC

      2024, Mathematics and Computers in Simulation
    View all citing articles on Scopus
    View full text