Simultaneous approximation of a smooth function and its derivatives by deep neural networks with piecewise-polynomial activations

doi:10.1016/j.neunet.2023.01.035

Neural Networks

Volume 161, April 2023, Pages 242-253

https://doi.org/10.1016/j.neunet.2023.01.035 Get rights and content

Highlights

•
Rates and complexity for smooth function approximation in Hölder norms by ReQU neural networks.
•
Explicit and uniform bounds for weights of the approximating neural network.
•
Exponential convergence rates for analytic functions.

Abstract

This paper investigates the approximation properties of deep neural networks with piecewise-polynomial activation functions. We derive the required depth, width, and sparsity of a deep neural network to approximate any Hölder smooth function up to a given approximation error in Hölder norms in such a way that all weights of this neural network are bounded by 1. The latter feature is essential to control generalization errors in many statistical and machine learning applications.

Introduction

Neural networks have recently gained much attention due to their impressive performance in many complicated practical tasks, including image processing (LeCun, Bengio, & Hinton, 2015), generative modeling (Goodfellow et al., 2014), reinforcement learning (Mnih et al., 2015), numerical solution of PDEs, e.g., Geist et al., 2021, Han et al., 2018, and optimal control (Chen et al., 2019, Onken et al., 2022). This makes them extremely useful in design of self-driving vehicles (Li, Ota, & Dong, 2018) and robot control systems, e.g., Bozek et al., 2020, Cembrano et al., 1994, González-Álvarez et al., 2022. One of the reasons for such a success of neural networks is their expressiveness, that is, the ability to approximate functions with any desired accuracy. The question of expressiveness of neural networks has a long history and goes back to the papers (Cybenko, 1989, Funahashi, 1989, Hornik, 1991). In particular, in Cybenko (1989), the author showed that one hidden layer is enough to approximate any continuous function $f$ with any prescribed accuracy $ɛ > 0$ . However, further analysis revealed the fact that deep neural networks may require much fewer parameters than the shallow ones to approximate $f$ with the same accuracy. Many efforts were put in recent years to understand the fidelity of deep neural networks. In a pioneering work (Yarotsky, 2017), the author showed that for any target function $f$ from the Sobolev space $W^{n, \infty} ({[0, 1]}^{d})$ there is a neural network with $O (ɛ^{- d / n})$ parameters and ReLU activation function, that approximates $f$ within the accuracy $ɛ$ with respect to the $L_{\infty}$ -norm on the unit cube ${[0, 1]}^{d}$ . Further works in this direction considered various smoothness classes of the target functions (Gühring and Raslan, 2021, Li et al., 2020a, Lu et al., 2021, Shen et al., 2021), neural networks with diverse activations (De Ryck et al., 2021, Gühring and Raslan, 2021, Jiao et al., 2021, Langer, 2021), domains of more complicated shape (Shen, Yang, & Zhang, 2020), and measured the approximation errors with respect to different norms (De Ryck et al., 2021, Gühring and Raslan, 2021, Schmidt-Hieber, 2020, Yarotsky, 2017). Several authors also considered the expressiveness of neural networks with different architectures. This includes wide neural networks of logarithmic (Gühring and Raslan, 2021, Schmidt-Hieber, 2020, Yarotsky, 2017) or even constant depth (De Ryck et al., 2021, Li et al., 2020a, Li et al., 2020b, Shen et al., 2021), or deep and narrow networks (Hanin, 2019, Kidger and Lyons, 2020, Park et al., 2021). Most of the existing results on the expressiveness of neural networks measure the quality of approximation with respect to either the $L_{\infty}$ - or $L_{p}$ -norm, $p ⩾ 1$ . Much fewer papers study the approximation of derivatives of smooth functions. These rare exceptions include Gühring et al., 2020, Gühring and Raslan, 2021 and De Ryck et al. (2021).

In the present paper, we focus on feed-forward neural networks with piecewise-polynomial activation functions of the form $σ^{ReQU} (x) = {(x \lor 0)}^{2}$ . Neural networks with such activations are known to successfully approximate smooth functions from the Sobolev and Besov spaces with respect to the $L_{\infty}$ - and $L_{p}$ -norms (see, for instance, Abdeljawad and Grohs, 2022, Ali and Nouy, 2021, Chen et al., 2022, Gribonval et al., 2022, Klusowski and Barron, 2018, Li et al., 2020a, Li et al., 2020b, Siegel and Xu, 2022). We continue this line of research and study the ability of such neural networks to approximate not only smooth functions themselves but also their derivatives. We derive the non-asymptotic upper bounds on the Hölder norm between the target function and its approximation from a class of sparsely connected neural networks with ReQU activations. In particular, we show that, for any $f$ from a Hölder ball $H^{β} ({[0, 1]}^{d}, H)$ , $H > 0$ , $β > 2$ , (see Section 2 for the definition) and any $ɛ > 0$ , there exists a neural network with ReQU activation functions, that uniformly approximates the target function in the norms of the Hölder spaces $H^{ℓ} ({[0, 1]}^{d})$ for all $ℓ \in {0, \dots, ⌊ β ⌋}$ . Here and further in the paper, $⌊ β ⌋$ stands for the largest integer which is strictly smaller than $β$ . A simplified statement of our main result is given below.

Theorem 1 simplified version of Theorem 2

Fix $β > 2$ and $p, d \in N$ . Then, for any $H > 0$ , for any $f : {[0, 1]}^{d} \to R^{p}$ , $f \in H^{β} ({[0, 1]}^{d}, H)$ and any integer $K ⩾ 2$ , there exists a neural network $h_{f} : {[0, 1]}^{d} \to R^{p}$ with ReQU activation functions such that it has $O (log d + ⌊ β ⌋ + log log H)$ layers, at most $O (p \lor d {(K + ⌊ β ⌋)}^{d})$ neurons in each layer and $O (p (d β + d^{2} + log log H) {(K + ⌊ β ⌋)}^{d})$ non-zero weights taking their values in $[- 1, 1]$ . Moreover, it holds that ${‖ f - h_{f} ‖}_{H^{ℓ} ({[0, 1]}^{d})} ⩽ \frac{C^{β d} H β^{ℓ}}{K^{β - ℓ}} for all ℓ \in {0, \dots, ⌊ β ⌋},$ where $C$ is a universal constant.

We provide explicit expressions for the hidden constants in Theorem 2. The main contributions of our work can be summarized as follows.

•
Given a smooth target function $f \in H^{β} ({[0, 1]}^{d}, H)$ , we construct a neural network, that simultaneously approximates all the derivatives of $f$ up to order $⌊ β ⌋$ with optimal dependence of the precision on the number of non-zero weights. That is, if we denote the number of non-zero weights in the network by $N$ , then it holds that ${‖ f - h_{f} ‖}_{H^{ℓ} ({[0, 1]}^{d})} = O (N^{- (β - ℓ) / d})$ simultaneously for all $ℓ \in {0, \dots, ⌊ β ⌋}$ .
•
The constructed neural network has almost the same smoothness as the target function itself while approximating it with the optimal accuracy. This property turns out to be very useful in many applications including the approximation of PDEs and density transformations where we need to use derivatives of the approximation.
•
The weights of the approximating neural network are bounded in absolute values by $1$ . The latter property plays a crucial role in deriving bounds on the generalization error of empirical risk minimizers in terms of the covering number of the corresponding parametric class of neural networks. Note that the upper bounds on the weights provided in De Ryck et al., 2021, Gühring et al., 2020, Gühring and Raslan, 2021 blow up as the approximation error decreases.

The rest of the paper is organized as follows. In Section 2, we introduce necessary definitions and notations. Section 3 contains the statement of our main result, Theorem 2, followed by a detailed comparison with the existing literature. We then present numerical experiments in Section 4. The proofs are collected in Section 5. Some auxiliary facts are deferred to Appendix.

Section snippets

Norms.

For a matrix $A$ and a vector $v$ , we denote by ${‖ A ‖}_{\infty}$ and ${‖ v ‖}_{\infty}$ the maximal absolute value of entries of $A$ and $v$ , respectively. ${‖ A ‖}_{0}$ and ${‖ v ‖}_{0}$ shall stand for the number of non-zero entries of $A$ and $v$ , respectively. Finally, the Frobenius norm and operator norm of $A$ are denoted by ${‖ A ‖}_{F}$ and $‖ A ‖$ , respectively, and the Euclidean norm of $v$ is denoted by $‖ v ‖$ . For a function $f : Ω \to R^{d}$ , we set ${‖ f ‖}_{L_{\infty} (Ω)} = \underset{x \in Ω}{esssup} ‖ f (x) ‖, {‖ f ‖}_{L_{p} (Ω)} = {\{\int_{Ω} {‖ f (x) ‖}^{p} d x\}}^{1 / p}, p ⩾ 1 .$ If the domain $Ω$ is clear from the context, we simply write $L_{\infty}$ or $L$

Approximation of functions from Hölder classes

Our main result states that any function from $H^{β} ({[0, 1]}^{d}, H)$ , $H > 0$ , $β > 2$ , can be approximated by a feed-forward deep neural network with ReQU activation functions in $H^{ℓ} ({[0, 1]}^{d})$ , $ℓ \in {0, \dots, ⌊ β ⌋}$ .

Theorem 2

Let $β > 2$ and let $p, d \in N$ . Then, for any $H > 0$ , for any $f : {[0, 1]}^{d} \to R^{p}$ , $f \in H^{β} ({[0, 1]}^{d}, H)$ and any integer $K ⩾ 2$ , there exists a neural network $h_{f} : {[0, 1]}^{d} \to R^{p}$ of the width $(4 d {(K + ⌊ β ⌋)}^{d}) \lor 12 ((K + 2 ⌊ β ⌋) + 1) \lor p$ with $6 + 2 (⌊ β ⌋ - 2) + ⌈ {log}_{2} d ⌉ + 2 (⌈{log}_{2} (2 d ⌊ β ⌋ + d) \lor {log}_{2} {log}_{2} H⌉ \lor 1)$ hidden layers and at most $p {(K + ⌊ β ⌋)}^{d} C (β, d, H)$ non-zero weights taking their values

Numerical experiments

In this section, we provide numerical experiments to illustrate the approximation properties of neural networks with ReQU activations. We considered a scalar function $f (x) = sin (x_{1}^{2} x_{2})$ of two variables and approximated it on the unit square ${[0, 1]}^{2}$ via neural networks with two types of activations: ReLU and ReQU. All the neural networks were fully connected, and all their hidden layers had a width $16$ . The first layer had a width $2$ . The depth of neural networks took its values in ${1, 2, 3, 4, 5}$ . In

Proof of Theorem 2

Step 1. Let $f = (f_{1}, \dots, f_{p})$ . Consider a vector $a = (a_{1}, \dots, a_{2 ⌊ β ⌋ + K + 1})$ , such that $a_{1} = \dots = a_{⌊ β ⌋ + 1} = 0; a_{⌊ β ⌋ + 1 + j} = j / K, 1 ⩽ j ⩽ K - 1; a_{⌊ β ⌋ + K + 1} = \dots = a_{2 ⌊ β ⌋ + K + 1} = 1 .$ By Theorem 3, there exist tensor-product splines $S_{f}^{⌊ β ⌋, K} = (S_{f, 1}^{⌊ β ⌋, K}, \dots, S_{f, p}^{⌊ β ⌋, K})$ of order $⌊ β ⌋ ⩾ 2$ associated with knots at ${(a_{j_{1}}, a_{j_{2}}, \dots, a_{j_{d}}) : j_{1}, \dots, j_{d} \in {1, \dots, 2 ⌊ β ⌋ + K + 1}}$ such that ${‖ f - S_{f}^{⌊ β ⌋, K} ‖}_{H^{ℓ} ({[0, 1]}^{d})} ⩽ max_{m \in {1, \dots, p}} {‖ f_{m} - S_{f, m}^{⌊ β ⌋, K} ‖}_{H^{ℓ} ({[0, 1]}^{d})} ⩽ \frac{{(\sqrt{2} e d)}^{β} H}{K^{β - ℓ}} + \frac{9^{d (⌊ β ⌋ - 1)} {(2 ⌊ β ⌋ + 1)}^{2 d + ℓ} {(\sqrt{2} e d)}^{β} H}{K^{β - ℓ}} .$ Our goal is to show that $S_{f}^{⌊ β ⌋, K}$ can be represented by a neural network $h_{f}$ with ReQU

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The publication was supported by the grant for research centers in the field of AI provided by the Analytical Center for the Government of the Russian Federation (ACRF) in accordance with the agreement on the provision of subsidies (identifier of the agreement 000000D730321P5Q0002) and the agreement with HSE University, Russia No. 70-2021-00139. Nikita Puchkin is a Young Russian Mathematics award winner and would like to thank its sponsors and jury. Denis Belomestny acknowledges the financial

References (41)

CembranoG. et al.
Neural networks for robot control
Annual Review in Automatic Programming
(1994)
ChenQ. et al.
Power series expansion neural network
Journal of Computer Science
(2022)
De RyckT. et al.
On the approximation of functions by tanh neural networks
Neural Networks
(2021)
FunahashiK.-I.
On the approximate realization of continuous mappings by neural networks
Neural Networks
(1989)
GühringI. et al.
Approximation rates for neural networks with encodable weights in smoothness spaces
Neural Networks
(2021)
HornikK.
Approximation capabilities of multilayer feedforward networks
Neural Networks
(1991)
ShenZ. et al.
Neural network approximation: Three hidden layers are enough
Neural Networks
(2021)
SiegelJ.W. et al.
High-order approximation rates for shallow neural networks with cosine and ${ReLU}^{k}$ activation functions
Applied and Computational Harmonic Analysis
(2022)
XuZ.-B. et al.
Simultaneous $L_{p}$ -approximation order for neural networks
Neural Networks
(2005)
YarotskyD.
Error bounds for approximations with deep ReLU networks
Neural Networks
(2017)

AbdeljawadA. et al.

Approximations with deep neural networks in Sobolev time-space

Analysis and Applications

(2022)

AliM. et al.

Approximation of smoothness classes by deep rectifier networks

SIAM Journal on Numerical Analysis

(2021)

BozekP. et al.

Neural network control of a wheeled mobile robot based on optimal trajectories

International Journal of Advanced Robotic Systems

(2020)

CandèsE. et al.

Fast computation of Fourier integral operators

SIAM Journal on Scientific Computing

(2007)

CandèsE. et al.

A fast butterfly algorithm for the computation of Fourier integral operators

Multiscale Modeling & Simulation. A SIAM Interdisciplinary Journal

(2009)

ChenY. et al.

Optimal control via neural networks: A convex approach

CybenkoG.

Approximation by superpositions of a sigmoidal function

Mathematics of Control, Signals, and Systems

(1989)

GeistM. et al.

Numerical solution of the parametric diffusion equation by deep neural networks

Journal of Scientific Computing

(2021)

González-Álvarez, M., Dupeyroux, J., Corradi, F., & de Croon, G. C. (2022). Evolved neuromorphic radar-based altitude...

GoodfellowI. et al.

Generative adversarial nets

Cited by (10)

Theoretical guarantees for neural control variates in MCMC
2024, Mathematics and Computers in Simulation
In this paper, we propose a variance reduction approach for Markov chains based on additive control variates and the minimization of an appropriate estimate for the asymptotic variance. We focus on the particular case when control variates are represented as deep neural networks. We derive the optimal convergence rate of the asymptotic variance under various ergodicity assumptions on the underlying Markov chain. The proposed approach relies upon recent results on the stochastic errors of variance reduction algorithms and function approximation theory.
Neural networks with ReLU powers need less depth
2024, Neural Networks
Despite the widespread success of deep learning in various applications, neural network theory has been lagging behind. The choice of the activation function plays a critical role in the expressivity of a neural network but for reasons that are not yet fully understood. While the rectified linear unit (ReLU) is currently one of the most popular activation functions, ReLU squared has only recently been empirically shown to be pivotal in producing consistently superior results for state-of-the-art deep learning tasks (So et al., 2021). To analyze the expressivity of neural networks with ReLU powers, we employ the novel framework of Gribonval et al. (2022) based on the classical concept of approximation spaces. We consider the class of functions for which the approximation error decays at a sufficiently fast rate as network complexity, measured by the number of weights, increases. We show that when approximating sufficiently smooth functions that cannot be represented by sufficiently low-degree polynomials, networks with ReLU powers need less depth than those with ReLU. Moreover, if they have the same depth, networks with ReLU powers can have potentially faster approximation rates. Lastly, our computational experiments on approximating the Rastrigin and Ackley functions with deep neural networks showed that ReLU squared and ReLU cubed networks consistently outperform ReLU networks.
A prediction and behavioural analysis of machine learning methods for modelling travel mode choice
2023, Transportation Research Part C: Emerging Technologies
The emergence of a variety of Machine Learning (ML) approaches for travel mode choice prediction poses an interesting question to transport modellers: which models should be used for which applications? The answer to this question goes beyond simple predictive performance, and is instead a balance of many factors, including behavioural interpretability and explainability, computational complexity, and data efficiency. There is a growing body of research which attempts to compare the predictive performance of different ML classifiers with classical Random Utility Models (RUMs). However, existing studies typically analyse only the disaggregate predictive performance, ignoring other aspects affecting model choice. Furthermore, many existing studies are affected by technical limitations, such as the use of inappropriate validation schemes, incorrect sampling for hierarchical data, a lack of external validation, and the exclusive use of discrete metrics. In this paper, we address these limitations by conducting a systematic comparison of different modelling approaches, across multiple modelling problems, in terms of the key factors likely to affect model choice (out-of-sample predictive performance, accuracy of predicted market shares, extraction of behavioural indicators, feature importance analysis, and computational efficiency). The modelling problems combine several real world datasets with synthetic datasets, where the data generation function is known. The results indicate that the models with the highest disaggregate predictive performance (namely Extreme Gradient Boosting (XGBoost) and Random Forests (RF)) provide poorer estimates of behavioural indicators and aggregate mode shares, and are more expensive to estimate, than other models, including Deep Neural Networks (DNNs) and Multinomial Logit (MNL). It is further observed that the MNL model performs robustly in a variety of situations, though ML techniques can improve the estimates of behavioural indices such as Willingness To Pay (WTP).
A Wasserstein perspective of Vanilla GANs
2024, arXiv
Trajectory Correction of the Rocket for Aerodynamic Load Shedding Based on Deep Neural Network and the Chaotic Evolution Strategy with Covariance Matrix Adaptation
2024, IEEE Transactions on Aerospace and Electronic Systems
REFINED GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD AND PHYSICS-INFORMED NEURAL NETWORKS
2024, arXiv

View all citing articles on Scopus

View full text

Simultaneous approximation of a smooth function and its derivatives by deep neural networks with piecewise-polynomial activations

Highlights

Abstract

Introduction

Section snippets

Norms.

Approximation of functions from Hölder classes

Numerical experiments

Proof of Theorem 2

Declaration of Competing Interest

Acknowledgments

Annual Review in Automatic Programming

Journal of Computer Science

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Applied and Computational Harmonic Analysis

Neural Networks

Neural Networks

Approximations with deep neural networks in Sobolev time-space

Analysis and Applications

Approximation of smoothness classes by deep rectifier networks

SIAM Journal on Numerical Analysis

Neural network control of a wheeled mobile robot based on optimal trajectories

International Journal of Advanced Robotic Systems

Fast computation of Fourier integral operators

SIAM Journal on Scientific Computing

A fast butterfly algorithm for the computation of Fourier integral operators

Multiscale Modeling & Simulation. A SIAM Interdisciplinary Journal

Optimal control via neural networks: A convex approach

Approximation by superpositions of a sigmoidal function

Mathematics of Control, Signals, and Systems

Numerical solution of the parametric diffusion equation by deep neural networks

Journal of Scientific Computing

Generative adversarial nets