Approximation rates for neural networks with general activation functions
Introduction
Deep neural networks have recently revolutionized a variety of areas of machine learning, including computer vision and speech recognition (LeCun, Bengio, & Hinton, 2015). A deep neural network with layers is a statistical model which takes the following form where is an affine linear function, is a fixed activation function which is applied pointwise and are the parameters of the model.
The approximation properties of neural networks have received a lot of attention, with many positive results. For example, in Ellacott (1994) and Leshno, Lin, Pinkus, and Schocken (1993) it is shown that neural networks can approximate any function on a compact set as long as the activation function is not a polynomial, i.e. that the set is dense in for any compact . An earlier result of this form can be found in Hornik, 1991, Hornik, 1993 and shows that derivatives can be approximated arbitrarily accurately as well. An elementary and constructive proof for functions can be found in Attali and Pagès (1997).
In addition, quantitative estimates on the order of approximation are obtained for sigmoidal activation functions in Barron (1993) and for periodic activation functions in Mhaskar and Micchelli, 1994, Mhaskar and Micchelli, 1995. Results for general activation functions can be found in Hornik, Stinchcombe, White, and Auer (1994). A remarkable feature of these results in that the approximation rate is , where is the number of hidden neurons, which shows that neural networks can overcome the curse of dimensionality. Results concerning the approximation properties of generalized translation networks (a generalization of two-layer neural networks) for smooth and analytic functions are obtained in Mhaskar (1996). Approximation estimates for multilayer convolutional neural networks are considered in Zhou (2018) and multilayer networks with rectified linear activation functions in Yarotsky (2017). A comparison of the effect of depth vs. width on the expressive power of neural networks is presented in Lu, Pu, Wang, Hu, and Wang (2017).
An optimal approximation rate in terms of highly smooth Sobolev norms is given in Petrushev (1998). This work differs from previous work and the current work in that it considers approximation of highly smooth functions, for which proof techniques based on the Hilbert space structure of Sobolev spaces can be used. In contrast, the line of reasoning initially pursued in Barron (1993) and continued in this work makes significantly weaker assumptions on the function to be approximated.
A review of a variety of known results, especially for networks with one hidden layer, can be found in Pinkus (1999). More recently, these results have been improved by a factor of in Klusowski and Barron (2016) using the idea of stratified sampling, based in part on the techniques in Makovoz (1996).
Our work, like much of the previous work, focuses on the case of two-layer neural networks. A two layer neural network can be written in the following particularly simple way where are parameters and is the number of hidden neurons in the model.
In this work, we study the how the approximation properties of two-layer neural networks depend on the number of hidden neurons. In particular, we consider the class of functions where the number of hidden neurons is bounded, and prove Theorem 2 concerning the order of approximation as for activation functions with polynomial decay, and Theorem 3, which applies to neural networks with periodic activation functions. Our results make the assumption that the function to be approximated, , has bounded Barron norm and we consider the problem of approximating on a bounded domain . This is a significantly weaker assumption than the strong smoothness assumption made in Kainen, Kurkova, and Vogt (2007) and Petrushev (1998). Similar results appear in Barron (1993) and Hornik et al. (1994), but we have improved their bound by a logarithmic factor for exponentially decaying activation functions and generalized these results to polynomially decaying activation functions. We also leverage this result to obtain a somewhat worse, though still dimension independent, approximation rate without the polynomial decay condition. This result ultimately applies to every activation function of bounded variation. Finally, we extend the stratified sampling argument in Klusowski and Barron (2016) to polynomially decaying activation functions in Theorem 5 and to periodic activation functions in Theorem 6. This gives an improvement on the asymptotic rate of convergence under mild additional assumptions.
The paper is organized as follows. In the second section, we discuss some basic results concerning the Fourier transform. We use these results to provide a simplified proof using Fourier analysis of the density result in Leshno et al. (1993) under the mild additional assumption of polynomial growth on . Then, in the third section, we study the order of approximation and prove Theorem 2, Theorem 3, extending the result in Barron (1993) and Hornik et al. (1994) to polynomially decaying activation functions, respectively periodic activation functions, and removing a logarithmic factor in the rate of approximation. In the fourth section, we provide a new argument using an approximate integral representation to obtain dimension independent results without the polynomial decay condition in Theorem 4. In the fifth section, we use a stratified sampling argument to prove Theorem 5, Theorem 6, which improves upon the convergence rates in Theorem 2, Theorem 3 under mild additional assumptions. This generalizes the results in Klusowski and Barron (2016) to more general activation functions. Finally, we give concluding remarks and further research directions in the conclusion.
Section snippets
Preliminaries
Our arguments will make use of the theory of tempered distributions (see Stein and Weiss, 2016, Strichartz, 2003 for an introduction) and we begin by collecting some results of independent interest, which will also be important later. We begin by noting that an activation function which satisfies a polynomial growth condition for some constants and is a tempered distribution. As a result, we make this assumption on our activation functions in the following theorems. We
Convergence rates in Sobolev norms
In this section, we study the order of approximation for two-layer neural networks as the number of neurons increases. In particular, we consider the space of functions represented by a two-layer neural network with neurons and activation function given in (4), and ask the following question: Given a function on a bounded domain, how many neurons do we need to approximate with a given accuracy?
Specifically, we will consider the problem of approximating a function with bounded Barron
Activation functions without decay
In this section, we show that one can remove the decay condition on , but this results in a slightly worse (though still dimension independent) approximation rate. The main new tool here is the use of an approximate integral representation, followed by an optimization over the accuracy of the representation. Finally, as a corollary we are able to obtain a dimension independent approximation rate for all bounded, integrable activation functions and all activation functions of bounded variation.
Theorem 4
An improved estimate using stratified sampling
In this section, we show how the argument in the proof of Theorem 2 can be improved using a stratified sampling method. Our argument is based on the method presented in Klusowski and Barron (2016) and Makovoz (1996). This method allows us to obtain an improved asymptotic convergence rate under additional smoothness assumptions on both the activation function and the function to be approximated.
Theorem 5 Let be a bounded domain and . If the activation function is non-zero and it
Conclusion
We have provided a few new results in the theory of approximation by neural networks. These results improve upon existing results on the rate of approximation achieved by two-layer neural network as the number of neurons increases by extending the results to more general activation functions. In particular, we obtain a dimension independent rate of for polynomials decaying activation functions, and a rate of for more general bounded activation functions. We also show how a stratified
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
We would like to thank Prof. Jason Klusowski and Dr. Juncai He for insightful discussions and helpful comments. This work was supported by a Penn State Institute for Cyber Science Seed Grant, USA, the Verne M. Willaman Fund, and the National Science Foundation (Grant No. DMS-1819157).
References (25)
- et al.
Approximations of functions by a multilayer perceptron: a new approach
Neural Networks
(1997) Approximation capabilities of multilayer feedforward networks
Neural Networks
(1991)Some new results on neural network approximation
Neural Networks
(1993)- et al.
A Sobolev-type upper bound for rates of approximation by linear combinations of heaviside plane waves
Journal of Approximation Theory
(2007) - et al.
Multilayer feedforward networks with a nonpolynomial activation function can approximate any function
Neural Networks
(1993) Random approximants and neural networks
Journal of Approximation Theory
(1996)- et al.
Degree of approximation by neural and translation networks with a single hidden layer
Advances in Applied Mathematics
(1995) Error bounds for approximations with deep relu networks
Neural Networks
(2017)Universal approximation bounds for superpositions of a sigmoidal function
IEEE Transactions on Information Theory
(1993)Aspects of the numerical analysis of neural networks
Acta Numerica
(1994)