Elsevier

Neural Networks

Volume 142, October 2021, Pages 619-635
Neural Networks

Deep ReLU neural networks in high-dimensional approximation

https://doi.org/10.1016/j.neunet.2021.07.027Get rights and content

Abstract

We study the computation complexity of deep ReLU (Rectified Linear Unit) neural networks for the approximation of functions from the Hölder–Zygmund space of mixed smoothness defined on the d-dimensional unit cube when the dimension d may be very large. The approximation error is measured in the norm of isotropic Sobolev space. For every function f from the Hölder–Zygmund space of mixed smoothness, we explicitly construct a deep ReLU neural network having an output that approximates f with a prescribed accuracy ɛ, and prove tight dimension-dependent upper and lower bounds of the computation complexity of the approximation, characterized as the size and depth of this deep ReLU neural network, explicitly in d and ɛ. The proof of these results in particular, relies on the approximation by sparse-grid sampling recovery based on the Faber series.

Introduction

Neural networks have been studied and used for almost 80 years, dating back to the foundational works of Hebb (1949), McCulloch and Pitts (1943) and Rosenblatt (1958). In recent years, deep neural networks have been successfully applied to a striking variety of Machine Learning problems, including computer vision (Krizhevsky, Sutskever, & Hinton, 2012), natural language processing (Wu, Schuster, Chen, Le, & Norouzi, 2016), speech recognition and image classification (LeCun, Bengio, & Hinton, 2015). The main advantage of deep neural networks over shallow ones is that they can output compositions of functions cheaply. Since their application range is getting wider, theoretical analysis to reveal the reason why deep neural networks could lead to significant practical improvements attracts substantial attention (Arora et al., 2017, Daubechies et al., 2019, Montúfar et al., 2014, Telgarsky, 2015, Telgarsky, 2016). In the last several years, there has been a number of interesting papers that address the role of depth and architecture of deep neural networks in approximating sets of functions which have a very special regularity properties such as analytic functions (E and Wang, 2018, Mhaskar, 1996), differentiable functions (Petersen and Voigtlaender, 2018, Yarotsky, 2017a), oscillatory functions (Grohs, Perekrestenko, Elbrachter, & Bolcskei, 2019), functions in isotropic Sobolev or Besov spaces (Ali and Nouy, 2020, Daubechies et al., 2019, Gribonval et al., 2021, Gühring et al., 2020, Yarotsky, 2017b) and functions in spaces of mixed smoothness (Montanelli and Du, 2019, Suzuki, 2019).

It has been shown that there is a close relation between the approximation by sampling recovery based on B-spline interpolation and quasi-interpolation representation, and the approximation by deep neural networks (Ali and Nouy, 2020, Daubechies et al., 2019, Montanelli and Du, 2019, Schwab and Zech, 2019, Suzuki, 2019, Yarotsky, 2017a, Yarotsky, 2017b). Most of these papers used deep ReLU (Rectified Linear Unit) neural networks for approximation since the rectified linear unit is a simple and preferable activation function in many applications. The output of such a network is a continuous piece-wise linear function which is easily and cheaply computed.

In recent decades, the high-dimensional approximation of functions or signals depending on a large number d of variables, has been of great interest since they can be applied in a striking number of fields such as Mathematical Finance, Chemistry, Quantum Mechanics, Meteorology, and, in particular, in Uncertainty Quantification and Deep Machine Learning. A numerical method for such problems may require a computational cost increasing exponentially in dimension d when the accuracy increases. This phenomenon is called the curse of dimensionality, coined by Bellmann (1957). Hence for an efficient computation in high-dimensional approximation, one of the key prerequisites is that the curse of dimension can be avoided or eased at least to some extent. In some cases this can be achieved, particularly when the functions to be approximated have an appropriate mixed smoothness, see Bungartz and Griebel (2004) and Novak and Woźniakowski, 2008, Novak and Woźniakowski, 2010 and references there. With this restriction one can apply approximation methods and sampling algorithms constructed on hyperbolic crosses and sparse grids which give a surprising effect since hyperbolic crosses and sparse grids have the number of elements much less than those of standard domains and grids but give the same approximation error. This essentially reduces the computational cost, and therefore makes the problem tractable.

The approximation by deep ReLU neural networks of functions having a mixed smoothness is very related to the high-dimensional sparse-grid approach which was introduced by Zenger for numerical solving partial differential equations (PDEs). For functions of mixed smoothness of integer order, high- dimensional sparse-grid approximations with application was investigated by Bungartz and Griebel (2004) employing hierarchical Lagrange polynomials multilevel basis and measuring the approximation error in the norm of L2(Id) or energy norm of W̊21(Id). In the paper Yserentant (2010) on the electronic Schrödinger equation with very large number of variables, Yserentant used sparse-grid methods for approximation of the eigenfunctions of the electronic Schrödinger operator having a certain mixed smoothness. Triebel (2015, Chapter 6) has indicated that when the initial data belongs to spaces with mixed smoothness, Navier–Stokes equations admit a unique solution having some mixed smoothness. There is a very large number of papers on sparse grids in various problems of high-dimensional approximation in numerical solving of PDEs and stochastic PDEs, etc. to mention all of them. The reader can see the surveys in Bungartz and Griebel (2004) and the references therein.

Consider the problem of approximation of functions f on Id in a space X of a particular smoothness by trigonometric (for periodic f) or dyadic B-splines with accuracy ɛ and error measured in the norm of the space Lp(Id) or isotropic Sobolev space Wpγ(Id) (1p,γ>0). If X is the Hölder space of isotropic smoothness α, then the computation complexity typically is estimated similarly by C(α,d,p)ɛd/α or C(α,d,p)ɛd/(αγ), respectively. For the Hölder space X of mixed smoothness α, the computation complexity is bounded by C(α,d,p)ɛ1/αlogd1(ɛ1) or C(α,d,p)ɛ1/(αγ), respectively, i.e., the bounds of the computation complexity are of quite different forms. Here, C(α,d,p) is a constant depending on α,d,p as well as the norm in which the smoothness is defined. Similar estimates hold true for the computation complexity of approximation of functions in Sobolev or Besov type spaces of isotropic and mixed smoothness. Notice also that only in the last case the term in ɛ is free from the dimension d. As usual, in classical settings of approximation problem which does not take account of dimension-dependence, this constant is not of interest and its value is not specified.

One of central problems in high-dimensional approximation is to give an evaluation explicit in d for the term C(α,d,p) in the above mentioned estimates of computation complexity to understand the tractability of the approximation problem.

We briefly recall some known results on approximation by deep ReLU neural networks directly related to the present paper. In Yarotsky (2017a), the author constructed a deep ReLU neural network of depth O(log(ɛ1)) and size O(ɛd/rlog(ɛ1)) that is capable in L(Id)-approximating with accuracy ɛ of functions on Id from the unit ball of isotropic Sobolev space Wr(Id) of smoothness rN. By using known results on VC-dimension of deep ReLU neural networks a lower bound also was established for this approximation. In Gühring et al. (2020), this result was extended to the isotropic Sobolev space Wpr(Id) with error measured in the norm of Wps(Id) for p[1,] and s<r. Considering the Lq(Id)-approximation of functions from the unit ball of Besov space Bp,θα(Id) of mixed smoothness α>max(0,1/p1/q) by deep ReLU networks of depth O(logN) and size O(NlogN), the author of Suzuki (2019) evaluated the approximation error as O(Nαlogα(d1)N). The lower bound of the approximation error was estimated via known results on linear N-widths. In Montanelli and Du (2019), the authors constructed a deep ReLU neural network for L(Id)-approximation with accuracy ɛ of a function with homogeneous boundary condition in Sobolev space Xp,2Wp2(Id) (p=2,) of mixed smoothness 2. Its depth and size are evaluated as O(log(ɛ1)logd) and O(ɛ1/2log32(d1)+1(ɛ1)(d1)), respectively. Notice that all the hidden constants in these estimates for computation complexity and convergence rate were not computed explicitly in dimension d. In particular, in the proof of the convergence rate in the paper (Suzuki, 2019), the author used a discrete quasi-norm equivalence of Besov spaces established (Dũng, 2011a) which does not allow to find such constants explicit in d. Also, due to the homogeneous boundary condition of the functions from the unit ball of the spaces Xp,2 considered in Montanelli and Du (2019), their d-dimensional L(Id)-norm is decreasing as fast as Mpd for some Mp>1, when d going to infinity, see Remark 4.6 for details.

The purpose of the present paper is to study the computation complexity of deep ReLU neural networks for the high-dimensional approximation of functions from Hölder–Zygmund space H̊α of mixed smoothness α satisfying the homogeneous boundary condition, when the dimension d may be very large. The approximation error is measured in the norm of the isotropic Sobolev space W̊p1W̊p1(Id). We focus our attention on d-dependence of this computation complexity. For every function fH̊α, we want to explicitly construct a deep ReLU neural network having an output that approximates f with a prescribed accuracy ɛ, and prove d-dependent bounds of the computation complexity of this approximation characterized as its size and depth, explicitly in d and ɛ (cf. Anthony and Bartlett, 2009, Daubechies et al., 2019, Montanelli and Du, 2019, Yarotsky, 2017a).

Let us emphasize that this problem of approximating functions from the space H̊α with error measured in the norm of the space W̊p1, in particular, the energy norm of the space VW̊21, naturally arises from some high-dimensional approximation and numerical methods of PDEs, see Bungartz and Griebel, 0000, Bungartz and Griebel, 2004, Garcke et al., 2001 and Griebel and Knapek (2009) for Poisson’s equation. For elliptic PDEs with homogeneous boundary condition, if the initial data and diffusion coefficients have a mixed smoothness, then the solution belongs to H̊α with a certain α>0. One then can consider the problem of approximation of this solution by deep ReLU neural networks with error measured in the energy norm of V. See a detailed example in Remark 4.5.

We briefly describe our contribution to high-dimensional approximation by deep ReLU neural networks. Denote by Ůα the unit ball in the space H̊α. For every fŮα, we explicitly construct a deep ReLU neural network Φf having an output N(Φf,) that approximates f in the W̊p1-norm with a prescribed accuracy ɛ and having the computation complexity expressing the dimension-dependent size W(Φf)C1(α,p)Bd(ɛ1)1α1log(ɛ1),and the dimension-dependent depth L(Φf)C2(α)logdlog(ɛ1),where B=B(d,α,p)>0. Notice the upper bounds of the size W(Φf) and the depth L(Φf) consist of three terms. The first term is independent of the dimension d and the accuracy ɛ, the second term depends only on the dimension d and the third term depends only on the accuracy ɛ. For the depth L(Φf) the second term logd is very mild. If a light restriction holds, in particular when 1p2, for the size W(Φf) the second term Bd satisfies the inequality B>1 when d>d0(α,p).

By using a recent result on VC-dimension bounds for piecewise linear neural networks in Bartlett, Harvey, Liaw, and Mehrabian (2019) we prove the following dimension-dependent lower bound for the case when p=. For a given ɛ>0, if A is a neural network architecture of depth L(A)Clog(ɛ1) such that for any fŮα, there is a deep ReLU neural network Φf of architecture A that approximates f with accuracy ɛ, then there exists a constant C3(α)>0 such that W(A)C3(α)24dα1ɛ1α1(log(ɛ1))2.

The proof of these results, in particular, the construction of the approximating deep ReLU neural networks relies on interpolation sampling recovery methods on sparse-grids of points tailored fit to the Hölder–Zygmund mixed smoothness α and the regularity of the isotropic Sobolev space W̊p1. These sampling recovery methods are explicitly constructed as a truncated Faber series of functions to be approximated.

Let us analyze some differences in the proofs of results between the present paper and the close paper (Suzuki, 2019) as well as the other related papers (Gühring et al., 2020, Montanelli and Du, 2019, Yarotsky, 2017a) (see also Remark 3.2, Remark 3.3, Remark 4.3, Remark 4.4).

Firstly, to prove the results in Suzuki (2019), the author employed discrete (quasi-)norm equivalence in terms of the valued-functional coefficients of B-spline quasi-interpolation representation for the Besov space Bp,θα(Id) (Dũng, 2011a). But, as mentioned above, this does not allow to estimate the dimension-dependent component of the approximation error. In the present paper, by using the representation of functions by Faber series we obtained the dimension-dependent bounds for the size and depth of a deep ReLU neural network required for approximation of functions from Ůα. This is a difference in the proofs between (Suzuki, 2019) and the present paper.

Secondly, in both the papers functions to be approximated have a certain anisotropic mixed smoothness, but the norm measuring approximation error used in Suzuki (2019) (also in Montanelli & Du, 2019) is of the Lebesgue space Lq(Id), while in our paper is of the isotropic space Sobolev W̊p1(Id). The anisotropic mixed smoothness and the difference between the norms of Lq(Id) and W̊p1(Id) together lead to different methods of construction of (quasi-)interpolation sparse-grid sampling approximation and hence of deep ReLU neural network approximation (notice that these methods are similar if functions to be approximated have an isotropic smoothness Gühring et al., 2020, Yarotsky, 2017a). In particular, the authors in Montanelli and Du (2019) and Suzuki (2019) used classical Smolyak grids, while in this paper we use “notched” Smolyak grids. Therefore, the sparsity of the grid points for interpolation sampling in our paper is much higher than the sparsity of those in Montanelli and Du (2019) and Suzuki (2019).

The outline of this paper is as follows. In Section 2, we recall necessary knowledge of deep ReLU neural networks. Section 3 introduces function spaces under consideration, presents a representation of continuous functions on the unit cube Id by Faber series and proves some error estimates of approximation by sparse-grid sampling recovery for functions in Hölder–Zygmund classes Ůα. In Section 4, based on the results in Section 3, we construct a deep ReLU neural network that approximates in the norm of the space W̊p1 functions in Ůα and prove upper and lower estimates for the size and depth required. Some concluding remarks are presented in Section 5.

Notation.  As usual, N denotes the natural numbers, Z the integers, R the real numbers and N0{sZ:s0}; N1=N0{1}. The letter d is always reserved for the underlying dimension of Rd, Nd, etc., and [d] denotes the set of all natural numbers from 1 to d. Vectorial quantities are denoted by boldface letters and xi denotes the ith coordinate of xRd, i.e., x(x1,,xd). For xRd and k,sN0d we use the notation 2x(2x1,,2xd) and 2ks(2k1s1,,2kdsd). The symbol |Ω| stands for the cardinality of the finite set Ω. For xRd we write |x|0=|{xj0,j=1,,d}| and if 0<p we denote |x|p(i=1d|xi|p)1/p with the usual modification when p=. The notations ||0 and ||p are extended to matrices in Rm×n. For the function f on Rd, supp(f) denotes the support of f. The value g() of the function g of one variable is understood as g()=limpg(p) when the limit exists.

Section snippets

Deep ReLU neural networks

There is a wide variety of deep neural network architectures and each of them is adapted to specific tasks. For approximation of functions from Hölder–Zygmund spaces, in this section we introduce feed-forward deep ReLU neural networks with one-dimension output. We are interested in standard deep neural networks where only connections between neighboring layers are allowed. Let us introduce necessary definitions and elementary facts on deep ReLU neural networks.

Definition 2.1

Let d,LN and L2.

  • A deep neural

Faber series and high-dimensional sparse-grid sampling recovery

In this section we introduce the space Hα(Id) of functions having Hölder–Zygmund mixed smoothness α>0, and the isotropic Sobolev space W̊p1; recall a representation of continuous functions on Id by tensor product Faber series. This representation plays a fundamental role in construction of sparse-grid sampling recovery and of deep neural networks for approximation in the W̊p1-norm of functions from the space Hα(Id). We explicitly construct linear sampling methods on sparse grids Rβ(m,) and

Approximation by deep ReLU neural networks

In this section, we will apply the results on sparse-grid sampling recovery in the previous section to the approximation by deep ReLU neural networks of functions from Ůα. For every ɛ>0 and every fŮα, we will explicitly construct a deep ReLU neural network Φf having an architecture Aɛ independent of f, and the output N(Φf,) which approximates f in the norm of the isotropic Sobolev space W̊p1 with accuracy ɛ, and give dimension-dependent upper bounds for the size and the depth of Φf. We

Concluding remarks

We have explicitly constructed a deep ReLU neural network Φf having an output that approximates with an arbitrary prescribed accuracy ɛ in the norm of the isotropic Sobolev space W̊p1 functions fŮα having Hölder–Zygmund mixed smoothness α with 1<α2. For this approximation, we have established the dimension-dependent estimates for the computation complexity characterized by the size W(Φf) and the depth L(Φf) of this deep ReLU neural network: W(Φf)C1Bd(ɛ1)1α1log(ɛ1)andL(Φf)C2logdlog(ɛ1)

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under Grant number 102.01-2020.03. A part of this work was done when the authors were working at the Vietnam Institute for Advanced Study in Mathematics (VIASM). They would like to thank the VIASM for providing a fruitful research environment and working condition.

References (54)

  • Bungartz, H.-J., & Griebel, M. A note on the complexity of solving Pois-son’s equation for spaces of bounded mixed...
  • BungartzH.-J. et al.

    Sparse grids

    Acta Numerica

    (2004)
  • CiarletP.

    The finite element method for elliptic problems

    (1978)
  • DũngD.

    Optimal adaptive sampling recovery

    Advances in Computational Mathematics

    (2011)
  • DũngD.

    Sampling and cubature on sparse grids based on a B-spline quasi-interpolation

    Foundations of Computational Mathematics

    (2016)
  • DũngD. et al.

    Computation complexity of deep ReLU neural networks in high-dimensional approximation

    (2021)
  • DũngD. et al.
  • DũngD. et al.

    Dimension-dependent error estimates for sampling recovery on Smolyak grids based on B-spline quasi-interpolation

    Journal of Approximation Theory

    (2020)
  • DũngD. et al.

    N-Widths and ɛ-dimensions for high-dimensional approximations

    Foundations of Computational Mathematics

    (2013)
  • DaubechiesI. et al.

    Nonlinear approximation and (Deep) ReLU networks

    (2019)
  • DeVoreR. et al.

    Constructive approximation

    (1993)
  • EW. et al.

    Exponential convergence of the deep neural network approximation for analytic functions

    Science China Mathematics

    (2018)
  • GarckeJ. et al.

    Data mining with sparse grids

    Computing

    (2001)
  • GribonvalR. et al.

    Approximation spaces of deep neural networks

    (2021)
  • GriebelM. et al.

    Optimized general sparse grid approximation spaces for operator equations

    Mathematics of Computation

    (2009)
  • GrohsP. et al.

    Deep neural network approximation theory

    (2019)
  • GühringI. et al.

    Error bounds for approximations with deep ReLU neural networks in Ws,p norms

    Analysis and Applications (Singapore)

    (2020)
  • Cited by (27)

    • A multivariate Riesz basis of ReLU neural networks

      2024, Applied and Computational Harmonic Analysis
    • Numerical solving for generalized Black-Scholes-Merton model with neural finite element method

      2022, Digital Signal Processing: A Review Journal
      Citation Excerpt :

      They [66] theoretically pointed out that the linear FEM can be represented by an deep ReLU NN with at least two hidden layers. Yarotsky et al. [67] analyzed the error bounds of the deep ReLU NN in Sobolev space, after that, Dung et al. [68] extended the Deep ReLU NN to solve the high-dimensional differential equations. In this section, we introduce the NFEM and use it to solve generalized BSM equations.

    View all citing articles on Scopus
    View full text