Deep Learning via Dynamical Systems: An Approximation Perspective

We build on the dynamical systems approach to deep learning, where deep residual networks are idealized as continuous-time dynamical systems. Although theoretical foundations have been developed on the optimization side through mean-field optimal control theory, the function approximation properties of such models remain largely unexplored, especially when the dynamical systems are controlled by functions of low complexity. In this paper, we establish some basic results on the approximation capabilities of deep learning models in the form of dynamical systems. In particular, we derive general sufficient conditions for universal approximation of functions in $L^p$ using flow maps of dynamical systems, and we also deduce some results on their approximation rates for specific cases. Overall, these results reveal that composition function approximation through flow maps present a new paradigm in approximation theory and contributes to building a useful mathematical framework to investigate deep learning.


Introduction and Problem Formulation
Despite the empirical success of deep learning, one outstanding challenge is to develop a useful theoretical framework to understand its effectiveness by capturing the effect of sequential function composition in deep neural networks.In some sense, this is a distinguishing feature of deep learning that separates it from traditional machine learning methodologies.
One candidate for such a framework is the dynamical systems approach [E17, HR17, LCTE18], which regards deep neural networks as a discretization of an ordinary differential equation.Consequently, the latter can be regarded as the object of analysis in place of the former.An advantage of this idealization is that a host of mathematical tools from dynamical systems, optimal control and differential equations can then be brought to bear on various issues faced in deep learning, and more importantly, shed light on the role of composition on function approximation and learning.
Since its introduction, the dynamical systems approach led to much progress in terms of novel algorithms [PM19, ZZL + 19], architectures [HR17, CMH + 18, RH18, LZLD18, WYSO18, TSDL18, STD18] and emerging applications [ZW18, ZLLD18, CRBD18, LLH + 19].On the contrary, the present work is focused on the theoretical underpinnings of this approach.From the optimization perspective, it has been established that learning in this framework can be recast as an mean-field optimal control problem [LCTE18,EHL19], and local and global characterizations can be derived based on generalizations of the classical Pontryagin's maximum principle and the Hamilton-Jacobi-Bellman equation.Other theoretical developments include continuum limits and connections to optimal transport [SM17, TvG18a,SM19].Nevertheless, the other fundamental questions in this approach remain largely unexplored, especially when it comes to the function approximation properties of these continuous-time idealizations of deep neural networks.In this paper, we establish some basic results in this direction.
The Dynamical Systems Viewpoint.To set the context, we first introduce the dynamical systems viewpoint of deep learning [E17, LCTE18, RH18, EHL19].For simplicity, we will discuss fully-connected residual neural networks for this exposition, but our main results will largely not depend on such explicit architectures.
Essentially, supervised learning seeks to approximate some function F : X → Y , which we call the oracle, from samples of data from this oracle function.The set X is the input space (e.g.images, time series) and Y is the space of outputs (e.g.labels of images, next value of the time series).As typical in machine learning, to do this we define a hypothesis space and learning amounts to finding a particular F ∈ H which closely approximates F in some sense.For example, F can be found as a solution of the optimization problem: F = arg min F ′ ∈H F − F ′ , with • some appropriately chosen norm; Alternatively, we can solve the empirical risk minimization problem to obtain F = arg min F ′ ∈H N i=1 F(x i )−F ′ (x i ) where x i are samples from the input space X with corresponding labels y i = F(x i ).Just like classical linear basis models, support vector machines and so on, deep learning is yet another choice of hypothesis space, which we now describe.
Let us take X = R n as the input space and Y = R m be the space of outputs.A deep, residual, fullyconnected neural network applies a sequence of transformations to the inputs via the recursion Here, x ∈ R n is the input, z s is the hidden state at the s th layer and (3) The variables {W s , V s } and b s are the weights and biases of the s th transformation layer respectively, with q the hidden node size (typically, q ≥ n).Together, they constitute the trainable parameters in the layer.The activation function σ : R → R is applied element-wise.The most common choice for the activation function is the Rectified Linear Unit (ReLU), σ(z) ≡ ReLU(z) = max(0, z).Other choices include the sigmoid : σ(z) = 1/(1 + e −z ), the tanh: σ(z) = tanh(z) and so on.The number S is known as the depth of the neural network and it can be quite large for modern architectures, e.g. on the order of hundreds.The output of the entire network is z S , which can then be compared, after a further transformation g : R n → R m , to the label y ∈ R m corresponding to x.The transformation function g is typically kept simple, say an affine function.
We can write (2) compactly by defining θ s = (W s , V s , b s ) and Θ = R q×n × R n×q × R q to get z s+1 = z s + f θ s (z s ), θ s ∈ Θ, z 0 = x (4) which gives the deep residual network hypothesis space H resnet = {x → g(z S ) | {z s } satisfies (4), θ s ∈ Θ, s = 0, . . ., S − 1.} (5) In contrast with traditional approaches, the distinguishing feature of the deep residual network hypothesis space is the presence of iterated transformations governed by (4).This property makes direct analysis difficult due to the lack of mathematical tools to handle function compositions.
The dynamical system viewpoint of deep learning is introduced in part to resolve this issue.Instead of (4), we consider the following continuous-time idealization of (4): That is, we replace the discrete layer numbers s by a continuous variable t, which results in a new continuous-time dynamics described by an ordinary differential equation (ODE).Note that for this approximation to be precise, one would need a slight modification of the right hand side (4) into z s + δ • f θ s (z s ) for some small δ > 0. The limit δ → 0 with T = Sδ held constant gives (6) with the identification t ≈ δs.Empirical work shows that this modification is justified since for trained deep residual networks, z s+1 − z s tends to be small [VWB16, JAB + 17].Consequently, the trainable variables θ is a now a indexed by a continuous variable t.We will assume that each f θ is a Lipschitz continuous function on R n , so that (6) admits unique solutions (see Proposition 3.1).For a terminal time T > 0, z(T ) can be seen as a function of its initial condition x.Thus, we obtain a function ϕ T : R n → R n , mapping each x to the solution z(T ) at time horizon T .ϕ T is known as Poincaré map or the flow map of the dynamical system (6).As a result, we can replace the hypothesis space (5) by with the terminal time T playing the role of depth.In words, this hypothesis space contains functions which are g composed with flow maps of a dynamical system in the form of an ODE.It is also convenient to consider the hypothesis space of arbitrarily deep continuous-time networks as the union The key advantage of this viewpoint is that a variety of tools from continuous time analysis can be used to analyze various issues in deep learning.This was pursued for example, in [LCTE18,LH18] for learning algorithms and [RH18, CMH + 17, CMH + 18] on network stability.In this paper, we are concerned with the problem of approximation, which is one of the most basic mathematical questions we can ask given a hypothesis space.Let us outline the problem below.
The Problem of Approximation.The problem of approximation essentially asks how big H ode is.In other words, what kind of functions can we approximate using functions in H ode ?Before we present our results, let us first distinguish the concept of approximation and that of representation.
• We say that a function F can be represented by H ode if F ∈ H ode .
• In contrast, we say that F can be approximated by H ode if for any ε > 0, there exists a F ∈ H ode such that F − F ≤ ε.Here, • is some appropriately chosen norm.
Therefore, representation and approximation are mathematically distinct notions.The fact that some class of mappings cannot be represented by H ode does not prevent it from being approximated by H ode to arbitrary accuracy.For example, it is well-known that flow maps must be orientation-preserving (OP) homeomorphisms, which are a very small set of functions in the Baire Category sense [Pal74], but it is also known that OP homeomorphisms are dense in L p in dimensions larger than one [BG03].
In this paper we will work mostly in continuous time.

Main Results
In this section, we summarize our main results on the approximation properties of H ode and discuss their significance with respect to related results in the literature.Throughout this paper, we will adopt the following notation: 1. Let K be a measurable subset of R n .We denote by C(K) the space of real-valued continuous functions on K, with norm Vector-valued functions are denoted similarly.

A function
The smallest constant L for which this is true is denoted as Lip(f ).Let us begin with some definitions.Denote by F the set of functions that constitute the right-hand-side of Equation ( 6): This allows us to write H ode compactly without explicit reference to the parameterization as We will hereafter call F a control family, since they control the dynamics induced by the differential equation (6).Unless specified otherwise, we assume F contains only Lipschitz functions, which ensures existence and uniqueness of solutions to the corresponding ODEs (See.Proposition 3.1).
Next, we introduce the concept of approximation closure which is used throughout this paper.Definition 2.1 (Approximation Closure).Let F be a collection of continuous functions from R n to R n ′ .We denote by ac(F ) the approximation closure of F , meaning that u ∈ ac(F ) if for any compact set K ⊂ R n and ε > 0, there exist v ∈ F depending on ε and K, such that u − v C(K) ≤ ε.
We also define the following shorthand for the approximation closure of the convex hull where CH(•) denotes the usual convex hull.
In constructing approximation dynamics, a fundamental role is played by a type of functions called well functions, which we now define.Definition 2.2 (Well Function).We say a Lipschitz function h : R n → R is a well function if there exists a bounded open convex set Ω ⊂ R n such that Here the Ω is the closure of Ω in the usual topology on R n .
Moreover, we say that a vector valued function h : R n → R n ′ is a well function if each of its component h i : R n → R is a well function in the sense above.
The name "well function" highlights the rough shape of this type of functions: the zero set of a well function is like the bottom of a well.Of course, the "walls" of this well need not always point upwards and we only require that they are never zero outside of Ω.
We also define the notion of restricted affine invariance, which is weaker than the usual form of affine invariance.Definition 2.3 (Restricted Affine Invariance).Let F be a set of functions from R n to R n .We say that Assume F satisfies the following conditions: 1. F is restricted affine invariant (Definition 2.3).

CH(F ) contains a well function (Definition 2.2).
Then, for any ε > 0 there exists a F ∈ H ode such that Theorem 2.4 establishes a sufficient condition for F for which the induced flow maps form a universal approximating class.The covering assumption F(K) ⊂ g(R n ) is in some sense necessary, for if the range of g does not cover F(K), say it misses a open subset U ⊂ F(K), then no flow maps composed with it can approximate F. Fortunately, this condition is very easy to satisfy.For example, any non-degenerate linear function g is Lipschitz and onto.
The requirement n ≥ 2 is also necessary.In one dimension, the result is actually false, due to the topological constraint induced by flow maps of dynamical systems.More precisely, for n = 1 one can show that each u ∈ H ode must be continuous and increasing, and furthermore that its approximation closure also contains only increasing functions.Hence, there is no hope in approximating any function that is strictly decreasing on an open interval.However, we can prove the next best thing in one dimension: any continuous and increasing function can be approximated by a dynamical system driven by the control family F .Theorem 2.5 (Sufficient Condition for Universal Approximation in 1D).Let n = 1.Then, Theorem 2.4 holds under the additional assumption that F is increasing.Remark 2.6.Theorem 2.5 still holds if one replaces the L p (K) norm by C(K), and furthermore one can relax the restricted affine invariance property to invariance with respect to only D = ±1 and A = 1 in Definition 2.3, i.e. we only require symmetry and translation invariance.
In the proofs of these results, we rely on using the flow of the dynamical system (6) to "rearrange" the domain of the function g so that it resembles F. Here, the concept of well function plays a central role.It serves to induce some universally controllable dynamics: the portion for which the well function equals 0 leaves points invariant, whereas the portion for which it is non zero can drive, via the restricted affine invariance assumption, points to the desired locations.The combination of these effects is enough to rearrange the domain in an essentially arbitrary manner to achieve universal approximation.This gives a sketch of the proof of the main results in this paper.
Most existing theoretical work on the continuous-time dynamical systems approach to deep learning focus on optimization aspects in the form of mean-field optimal control [EHL19, LM19], or the connections between the continuous-time idealization to discrete time [TvG18b, SM17, SM19].The present paper focuses on the approximation aspects of continuous-time deep learning, which is less studied.One exception is the recent work of [ZGUA19], who derived some results in the direction of approximation.However, an important assumption there was that the driving force on the right hand side of ODEs (i.e. the control family F ) are themselves universal approximators.Consequently, the results do not elucidate the power of composition and flows, since each "layer" is already so complex to approximate any arbitrary function, and there is no need for the flow to perform any additional approximation.
In contrast, the approximation results here do not require F , or even CH(F ), to be universal approximators.In fact, F can be a very small set of functions, and the approximation power of these dynamical systems are by construction attributed to the dynamics of the flow.For example, assumption that CH(F ) contains a well function does not imply F that drives the dynamical system is complex, since the former can be much larger than the latter.In the 1D ReLU control family, one can easily construct a well function with respect to the interval Ω = (q 1 , q 2 ) by averaging two ReLU functions: 1 2 [ReLU(q 1 − x) + ReLU(x − q 2 )], but the control family F = {vReLU(w • +b)} is not complex enough to approximate arbitrary functions without further linear combinations.We will demonstrate in Section 4.1 that many other architectures induce control families that satisfy the conditions in Theorem 2.4 and Theorem 2.5, but the general statements derived above reveal some fundamental mechanics that may be at work in such deep models.
We also note that unlike results in [ZGUA19], the results here for n ≥ 2 do not require embedding the dynamical system in higher dimensions to achieve universal approximation.The negative results given in [ZGUA19] (and also [DDT19]), which motivated embedding in higher dimensions, are basically on limitation of representation: flow maps of ODEs are OP homeomorphisms and thus can only represent such mappings.However, these are not counter-examples for approximation, since an OP homeomorphism can approximate a mapping that is not OP to arbitrary accuracy in dimensions greater than or equal to two [BG03].
In relation to classical approximation theory, one can observe from subsequent proofs and constructions that the function approximation process here is dynamical in nature, in that it relies on a sequence of transformations of the domain of the function.This makes it very different from truncations of a basis expansion that is typically encountered in traditional approximation theory [DeV98].For instance, suppose we take m = 1 and g(x) = j α j x j to be a linear function in Theorem 2.4.Then, we may interpret H ode as a linear combination of dictionary functions selected from the dictionary built from flow maps In this sense, Theorem 2.4 is a statement about a type of nonlinear N -term approximation.In classical nonlinear approximation [DeV98], one usually have inf φ 1 ,...,φ N ∈D,α 1 ,...,α N F − j α j φ j decaying to 0 as N increases, but is non-zero for any finite N .However, in the case of the flow map dictionary, Theorem 2.4 shows that as long as N ≥ n, the infimum is actually 0. Of course, this relies on the fact that we are considering arbitrarily large times in the evolution, so a natural question is how the approximation rate depends on T .In section 4.1, we derive some results in this direction in the 1D case, which further highlight the distinguishing mechanics of this approximation process.
Although the present paper focuses on the continuous-time idealization, we should also discuss the results here in relation to the relevant work on the approximation theory of discrete deep neural networks.In this case, one line of work to establish universal approximation is to show that deep networks can approximate some other family of functions known to be universal approximators themselves, such as wavelets [Mal16] and shearlets [GKP19].Another approach is to focus on certain specific architectures, such as in [LPW + 17, LJ18, Zho18, BLT + 19, DDF + 19, EMW19], which sometimes allows for explicit asymptotic approximation rates to be derived for appropriate target function classes.Furthermore, non-asymptotic approximation rates for deep ReLU networks are obtained in [SYZ19b,SYZ19a].They are based on explicit constructions using composition, and hence is similar in flavor to the results here if we take an explicit control family and discretize in time.
With respect to these works, the main difference of the results presented here is that we study general properties of function composition via explicit constructions of approximating flows and formulate sufficient conditions for approximation.In particular, none of the approximation results we present here depend on reproducing some other function (polynomials, wavelets, etc) that is known to have universal approximation.Instead, we construct explicitly flow maps of dynamical systems to verify the approximation property.We also provide preliminary investigations into what kind of functions can be efficiently learned by a narrow and deep neural network in continuous time.In this sense, this approach is similar in flavor to the recently proposed Barron function framework for wide and deep networks [MW + 19], inspired by the original approximation results of Barron [Bar94] for shallow networks.
Lastly, the results here are also of relevance to mathematical control theory and the theory of dynamical systems.In fact, the problem of approximating functions by flow maps is closely related to the problem of controllability in the control theory [Sus17].However, there is one key difference: in the usual controllability problem on Euclidean spaces, our task is to steer one particular input x 0 to a desired output value ϕ(x 0 ).However, here we want to steer the entire set of input values in K to ϕ(K) by the same control θ(t).This can be thought of as an infinite-dimensional function space version of controllability, which is a much less explored area and present controllability results in infinite dimensions mostly focus on the control of partial differential equations [CL91,BD02].
In the theory of dynamical systems, it is well known that functions represented by flow maps possess restrictions.For example, [Pal74] gives a negative result that the diffeomorphisms generated by C 1 vector fields are few in the Baire category sense.Some works also give explicit criteria for mappings that can be represented by flows, such as [For55] in R 2 , [Utz81] in R n , and more recently, [Zha09] generalizes some results to the Banach space setting.However, these are results are on exact representation, not approximation, and hence do not contradict the positive results presented in this paper.The results on approximation properties are fewer.A relevant one is [BG03], who showed that every L p mapping can be approximated by orientation-preserving diffeomorphisms constructed using polar factorization and measure-preserving flows.The results of the current paper gives an alternative construction of a dynamical system whose flow also have such an approximation property.Moreover, Theorem 2.4 gives some weak sufficient conditions for any controlled dynamical system to have this property.In this sense, the results here further contribute to the understanding of the density of flow maps in L p .
The rest of the paper is organized as follows.In Section 3.1 we introduce some basic results in the theory of ordinary differential equations that we use throughout this paper.Section 3.2 introduces and establishes some preliminary results which leads to the proof of Theorem 2.5 in Section 4.1 first in 1D.This generally motivates the concept of well functions and their role in constructing rearrangement dynamics.Furthermore, we establish some simple results on the rates of approximation in specific cases.In Section 4.3, we prove Theorem 2.4 which generalizes the approximation result to higher dimensions.

Results on Ordinary Differential Equations
Throughout this paper, we use some elementary properties and techniques in classical analysis of ODEs.For completeness, we compile these results in this section.The proofs of well-known results are omitted and unfamiliar readers are referred to [Arn73] for a comprehensive introduction.
Consider an ODE of the following form where y 0 , y(t) ∈ R n and f : R n → R n is a Lipschitz function.An equivalent form of the ODE is the following integral form Proposition 3.1 (Existence and Uniqueness).The solution to (16) exists and is unique.Moreover, for each t, y(t) is a continuous function of y 0 .
In the rest of this subsection, we only state and prove results in the one dimensional case, which is what we need in this paper.Some of these results can be generalized to higher dimensions, and the readers can refer to [Arn73].
First, the following result demonstrate an important limitations of flow maps when it comes to representation: in one dimension, an ODE flow map must preserves order.Proposition 3.2.If y 1 and y 2 satisfy same equation, but with different initial value x 1 < x 2 .Then y 1 (t) < y 2 (t) for all t.
Proof.Suppose not, we assume y 1 (t 0 ) = y 2 (t 0 ) for some t 0 .Consider the following ODE: Then both y 1 (t 0 − •) and y 2 (t 0 − •) are solutions to the above.By uniqueness we have x 1 = x 2 , a contradiction.Since both y 1 and y 2 are continuous in t, we have y 1 (t) < y 2 (t) for all t.
More generally, in higher dimensions, any flow map must be an orientation preserving (OP) homeomorphism.The general definition of OP and the proof the previous statement is in [Arn73].For the results in this paper, we only need the one dimensional case proved in Proposition 3.2, where OP means continuous and increasing1 , with a continuous inverse.In higher dimensions, the OP property means that if you put a local coordinate chart onto some point, then under actions an OP mapping the coordinate chart will not change its local orientation.In particular, if ϕ is continuously differentiable, then ϕ is OP is roughly equivalent to det J ϕ > 0 at all points, where J ϕ is the Jacobian of ϕ.
Next we introduce the well-known Grönwall's Inequality.
Finally, we prove some practical results, which follow easily from classical results but are used in some proofs of the main body.
Proposition 3.4.Let y(s, •) be the ODE of the type (16), with initial value s.When s is in some compact set K, then the continuous modulus of finite time converges to 0 as r → 0 uniformly on s ∈ K.
Proof.We denote a = min K, b = max K , By Proposition 3.2, we know that y(a, t) ≤ y(s, t) ≤ y(b, t), thus H = {y(s, t) implying the result.
The following proposition shows that in one dimension, if we have a well function, we can transport one point into another if they are located in the same side of well function's zero interval.Proposition 3.5.Suppose f (x) < 0 in x ≥ x 0 .Then for x 0 < x 1 < x 2 .Consider the ODE: Then ultimately the ODE system will reach x 1 , i.e., for some T , y(T ) = x 1 .
Proof.Choose x 1 ∈ (x 0 , x 1 ) and define m = − sup y∈[ x 1 ,x 2 ] f (y).We have Set t = (x 2 − x 1 )/m.If y(t) ≤ x then we are done by continuity.Otherwise, we have which by again implies our result by continuity.
With these results on ODEs in mind, we now present the proofs of our main results.

From Approximation of Functions to Approximations of Domain Transformations
Now, we show that under mild conditions, as long as we can approximate any continuous domain transformation ϕ : R n → R n using flow maps, we can show that H ode is an universal approximator.Consequently, we can pass to the problem of approximating an arbitrary ϕ by flow maps in establishing our main results.Proposition 3.6.Let F : R n → R m be continuous and g : R n → R m be Lipschitz.Let K ⊂ R n be compact and suppose g(R n ) ⊃ F(K).Then, for any ε > 0 and p ∈ [1, ∞), there exists a continuous function ϕ : Proof.This follows from a general result on function composition proved in [LST19].We prove this in the special case here for completeness.
The set F(K) is compact, so for any δ > 0 we can form a partition F(K) = ∪ i=1,...,N B i with diam(B i ) ≤ δ.By assumption, g −1 (B i ) is non-empty for each i, so let us pick z i ∈ g −1 (B i ).For each i we define . ., z N }, which is bounded.By inner regularity of the Lebesgue measure, for any δ ′ > 0 and for each i we can find a compact and that K i 's are disjoint.By Urysohn's lemma, for each i there exists a continuous function ϕ i such that |ϕ i (x)| ≤ 1 for all x, ϕ i = 1 on K i and ϕ i = 0 on ∪ j i K j .Now, we form the continuous function We define the set K ′ = { N i=1 α i z i : α i ∈ [0, 1]}, which is clearly compact and ϕ(x) ∈ K ′ for all x.Then, we have We take δ ′ small enough so that the last term is bounded by δ.Then, we have Taking δ = ǫ/(1 + |K|) yields the result.
We shall hereafter assume that g(R n ) ⊃ F(K), which as discussed earlier, is easily satisfied by taking g to be any onto function.Hence we have the following immediate corollary.Corollary 3.7.Assume the conditions in Proposition 3.6.Let A be some collection of continuous functions from R n to R n such that for any δ > 0 and any continuous function ϕ : R n → R n , there exists

Properties of Attainable Sets and Approximation Closures
Owing to Corollary 3.7, for the rest of the paper we will focus on proving universal approximation of continuous transformation functions ϕ from R n to R n by flow maps of the dynamical system after which we can deduce universal approximation properties of H ode via Corollary 3.7.
We now establish some basic properties of flow maps as well as approximation closures.In principle, in our hypothesis space (10) we allow t → f t (z) to be any measurable mapping for any z ∈ R n .However, it turns out that to establish approximation results, it is enough to consider the smaller family of piece-wise constant in time mappings, i.e. f t = f j ∈ F for t ∈ [t j−1 , t j ).For a fixed f ∈ F , let P τ f denote the flow map of the following dynamics at time horizon τ: That is, The attainable set of a finite time horizon T due to piece-wise constant in time controls, denoted as A F (T ), is defined as In other words, A F (T ) contains the flow map of an ODE, whose driving force is f j for t ∈ [t j , t j−1 ), j = 1, . . ., k.It contains all the domain transformations that can be attained by an ODE by selecting a piece-wise constant in time driving force from F up to a terminal time T .The union of flow maps over all possible terminal times, A F = ∪ T >0 A F (T ), is the overall attainable set.In view of Corollary 3.7, to establish the approximation property of H ode it is sufficient to prove that any continuous transformation ϕ can be approximated by mappings in A F .Now, let us state some basic properties of the approximation closure defined in Definition 2.1, which are useful in the later sections.The proofs are immediate and hence omitted.Lemma 3.8.If A is a family of continuous and increasing functions from R to R, then ac(A) contains only increasing functions.Lemma 3.9.We have A ⊂ ac(A) = ac(ac(A)).Moreover, if A ⊂ B, then ac(A) ⊂ ac(B).
Next, we state and prove an important property about approximation closures of control families: F shares the same approximation ability as CH(F ) when used to drive dynamical systems.However, a convex hull of Lipschitz function family might not be a Lipschitz function family in general.Hence we adopt a slightly different description.Proposition 3.10.Let F be a Lipschitz control family.Then, for any Lipschitz control family G such that F ⊂ G ⊂ CH(F ), we have Proposition 3.10 is an important result concerning the effect of continuous evolution, which can be regarded as a continuous family of compositions: any function family driving a dynamical system is as good as its convex hull in driving the system, which can be an immensely larger family of functions.Similar properties of flows have been observed in the context of variational problems, see [War62].This is a first hint at the power of composition on function approximation.
To prove Proposition 3.10 we need the following lemmas.Lemma 3.11.If A F and A G are attainable sets of F and F ⊂ G ⊂ ac(F ) .Then we have Proof.It suffices to show that A G ⊂ ac(A F ), since ac(A F ) ⊂ ac(A G ) ⊂ ac(A F ), which implies the lemma. Suppose where each fi is in G. Fix a compact set K and ε > 0, we construct a function ϕ We prove by induction on k.First, the case when k = 0 is obvious since it is just the identity mapping.Suppose φ = P t k fk • ψ, where ψ is composition of k − 1 flow maps.For some ε 1 (to be determined later) and K we have some and By subtracting, we have By Grönwall's inequality, we have Then if we choose f k such that |f k − fk | ≤ ε 2 in B(ψ(K), (ε 1 + tε 2 )e t 1 Lip( fk ) ), then (39) remains valid.
Proof.We will show that P t/2n f P t/2n g Thus if w(t) satisfying: Then we have Recall that ω is the modulus of continuity defined in Proposition 3.4.Again, by Grönwall's inequality we have For any selected compact set K, ω z,[0,t] ( t 2n ) → 0 by Proposition 3.4, thus we obtain Proof.Immediate.Now, we are ready to prove Proposition 3.10.
Proof of Proposition 3.10.Using the same technique in the proof of Lemma 3.12, we can show that for f 1 , • • • , f m ∈ F , P t h ∈ ac(A F ), where h = q i f i for some rational numbers q i .Let A ′ be the attainable set with control family = F ′ = { m i=1 q i f i : q i ∈ Q, i q i = 1, f i ∈ F , m ∈ N}, then we have ac(A ′ ) = ac(A F ). Since ac(F ′ ) = F , we arrive at the desired result.

Proof of Main Results
In this section, we prove the main results (Theorem 2.4 and 2.5).We start with the one dimensional case to gain some insights on how a result can be established in general, and in particular, elucidate the role of well functions (Definition 2.2) in constructing rearrangement dynamics.This serves to motivate the extension of the results in higher dimensions.
4.1 Approximation Results in One Dimension and the Proof of Theorem 2.5 Proposition 3.2, together with the fact that compositions of continuous and increasing functions are again continuous and increasing, implies that any function from A F must be continuous and increasing.We will adopt the short form "CI" for such functions.In 1D, this poses a restriction on the approximation power of ac(A F ) as the following result shows: Proposition 4.1.Let n = 1 and F be a Lipschitz control family, whose attainable set is A F .Then ac(A F ) contains only increasing functions.
Proof.Proposition 3.2 implies that any function in A F is continuous and increasing, since both properties are closed under composition.The proposition then follows from Lemma 3.8).

It follows from Proposition 4.1 that any continuous function ϕ that is strictly decreasing over an interval [c, d] cannot be approximated by ac(A F ). Nevertheless, it makes sense to ask for the next best property: can ac(A F ) approximate any CI function?
To investigate this problem, we first select an appropriate control family, which corresponds to deep neural networks with ReLU activations, and see if it can indeed approximate any CI function.We will later remove this explicit architectural assumption.The ReLU control family is given by This is obtained by choosing σ(z) = ReLU(z) = max(0, z) in ( 6) and corresponds to a fully-connected residual network with ReLU activations in one dimension, with only one hidden node.Notice that the ReLU control family (44) satisfies the restricted affine invariant condition as defined in Definition 2.3.
We now show that in one dimension, flow maps of ODEs driven by the ReLU control family can in fact approximate any continuous function.Proposition 4.2.Let ϕ : R → R be a CI function and F be the ReLU control family (44).Then, for any ε > 0 and compact K ⊂ R, there exists ϕ ∈ A F such that ϕ − ϕ C(K) ≤ ε.In other words, ϕ ∈ ac(A F ).
Proof.We consider the following lemma, from which we can deduce the desired result .
We postpone the proof of Lemma 4.3 and first show how to prove Proposition 4.2 from it.By replacing K with a larger set, we can always assume that K is a closed interval.Consider a partition ∆ on K, with nodes We can find ψ ∈ A F such that ψ(x i ) = ϕ(x i ) for all i.Therefore Here |∆| := max 1≤i≤M |x i −x i−1 |.We deduce that ψ(x)−ϕ(x) ≥ −ω ϕ (|∆|) holds for the same reason.Hence we have ϕ − ψ C(K) ≤ ω ϕ (|∆|).Since ϕ is continuous, sending |∆| to 0 and using Proposition 3.4 gives the desired results.
Now it remains to prove Lemma 4.3 constructively.To do this, first observe that the definition of well function (Definition 2.2) when specialized to one dimension is a function for some q 2 > q 1 .This can be constructed by the ReLU family by Obviously, h Q ∈ CH(F ) ⊂ CH(F ), so that the condition that the latter contains a well function is trivially satisfied for the ReLU control family.
Proof of Lemma 4.3.By Proposition 3.10, we denote G = F ∪ {h Q : Q ⊂ K}.We will show that G can produce the desired approximation property.We construct a mapping ψ k which maps x i to y i , i = 1, 2, • • • , k by induction.First we show the base case k = 1.Take h Q to be the well function with respect to Q = [q 1 , q 2 ].Since F is translation invariant, we can suppose that both x 1 and y 1 are greater than q 2 .Since h Q does not change sign in [q 2 , ∞), by Proposition 3.5 we know that either x 1 into y 1 for some t.Thus we prove the base case (since F is symmetric, as we assumed).
Suppose we have ϕ k , now we will construct ϕ k+1 based on ϕ k .Applying ϕ k , we may assume that x i = y i , i = 1, 2, • • • , k. Again we assume h Q is a well function with zero interval Q = [q 1 , q 2 ](q 2 < min(x 1 , y 1 )) and h Q ′ is a well function with interval Q ′ = [q 0 , q 1 ].We further assume that h Clearly we have t 1 < t 2 .Choose any t ′ ∈ (t 1 , t 2 ), and ψ = P t as desired.By induction, we have completed the proof of Lemma 4.3.
Sufficient Conditions for Approximation of CI functions and the Proof of Theorem 2.5.We showed previously that all CI functions can be approximated by ReLU-driven dynamical systems.
In this section, we shall do away with an explicit architecture, which leads to the proof of Theorem 2.5.The key observation from the proof of Lemma 4.3, is that all we really need is having a well function contained in CH(A F ). On the other hand, whether or not F itself is a ReLU control family, or any other specific family, is inconsequential.This motivates us to ask the question of sufficiency: what assumptions on F is enough to guarantee that it is a universal control family?Notice that instead of constructing an explicit well function in the form of the average of two ReLU functions, we can just use an arbitrary well function as defined in 2.2 to drive the dynamics.
The following result makes this precise.Examples of Other Universal Control Families.As a consequence of Proposition 4.4, we can give some other control families that also drives ODEs with universal approximating flow maps.For example, let us consider the same family (44) but with sigmoid activation function in place of ReLU.We will show that it can approximate the soft-threshold function s Direct estimation shows that for all |x| < 2, Sending N = o(M) → ∞ we get the result.
Hence, a control family with sigmoid activations can approximate the soft-threshold function, which is a well function.By Proposition 4.4 the dynamical system driven by neural networks with sigmoid activations again satisfies the sufficient condition for universal approximation.A similar result holds for tanh activations, since tanh(x) = 2σ(2x) − 1.
To close this section, observe that using linear activations σ(z) = z constitute a control family which does not contain a well function in CH(F ).We also can immediately see that it cannot produce universal approximating flow maps, since the resulting flow maps are always linear functions.Remark 4.5.In one dimension, the ability for a dynamical system to approximate any continuous and increasing function has the immediate consequence that if we were to embed the dynamical system in two dimensions, then we can approximate any function ϕ of bounded variation, as long as we are allowed a linear transformation in the end, e.g. if g in Prop.3.6 is linear.This is because functions of bounded variation can always be written as a difference of two increasing functions.

Approximation Rates in One Dimension
All results so far are on whether a given function can be approximated by a dynamical system with control families satisfying certain conditions.However, by the very definition of the attainable set we are forced to consider dynamical systems of finite, but arbitrarily large time horizons.Just like in the development of traditional approximation theory, one may be interested to ask the following: given a approximation budget (here the time horizon T ), how well can we approximate a given function?Perhaps a more pertinent question is this: what kind of functions can be efficiently approximated by dynamical systems?
In this section, we give some results in this direction in the simplest case: the one dimensional case (n = 1) and the ReLU activation control family.For convenience of exposition, we assume that our target function ϕ is defined on [0, 1].We postpone results on general control families in higher dimensions to future work.
To properly quantize the efficiency, we should eliminate the positive homogeneity of the ReLU control function, which masks the effect of the time horizon T due to the ability to arbitrarily rescale time.Therefore, we restrict |v|, |w| ≤ 1 in vReLU(w • +b) and then the quantity of time horizon becomes meaningful.Remark 4.6.An alternative modification is using ´|w|dt to measure the approximation rate in place of T .This notation is more related to the Barron space analysis [MW + 19].It can be checked that one can change T into ´|w|dt in the following results.
First, we show in the following lemma that if ϕ is piecewise linear, then it can be represented by functions in A F (T ) for some T large enough.
Before introducing the following lemma, we first define the Total Variation with a slight modification.Suppose u is a function defined on [0, 1], we extend u to u E such that We define u TV = u E TV[−ε,1+ε] , the latter is defined as Lemma 4.7.If ln ϕ ′ is a piecewise constant function with K pieces, then ϕ can be written as Here P • • denotes the flow maps as defined in Section 3.2 and c is a constant.Moreover, we have ϕ ∈ (A F (T )) for T ≥ ln ϕ ′ TV Proof.The proof is inductive by construction.Take derivative on we obtain ln ϕ ′ = ln ′ .Since ln ϕ ′ is a piecewise constant function, it can be written as where all H i are Heaviside functions.Integrate , and then integrate we obtain P t K−2 g K−2 , and so on.One can easily verify that each P • • is a flow map generated by some ReLU activation function.Hence we prove the first part of the proposition.
For the second part, we notice that if f = ReLU(wx + b), then ln P t f ′ is a Heaviside function with a jump at x = −b/w.Since we can find a decomposition H i such that |w i | = ln ϕ ′ TV .Thus the second part of proposition is proven.
From the proof of the previous lemma, we know that In view of this lemma, we prove the following, which gives a quantitative approximation result.Proposition 4.8.Suppose ϕ : [0, 1] → R is an increasing function.Moreover, suppose ϕ is piecewise smooth and Proof.The key idea of the proof is separated into two parts.The first part is to show that the constant in Lemma 4.7 has negligible cost, by considering P t εReLU(•+M) P t −εReLU(•) .This provides a translation on [0, 1]: x → x + (e ε − 1)M.By sending M → ∞ we can construct any translation with negligible time cost.
. Now it suffices to prove a rather simple question: given a function u = ln ϕ ′ , on each pieces I ′ of u, we can use piecewise constant function v| I ′ to approximate u| I ′ (restriction on I ′ ) such that v TV[0,1] ≤ u TV[0,1] and v − u ∞ ≤ ε.Thus we can find a function ϕ ∈ A((1 − ε)T 0 ) such that ln ϕ ′ = (1 − ε)v.By compositing a translation, we know that there exists ψ ∈ A(T 0 ) such that Thus we conclude that f ∈ ac(A T 0 ) We highlight an interesting point about the preceding results.Although the time horizon T plays the role of a budget as in traditional approximation theory, the quantitative estimates are very different in flavor to classical results.In particular, Proposition 4.8 shows that for any sufficiently large but finite T ≥ T 0 , the approximation error is actually arbitrarily small.This is in stark contrast to classical approximation results, where one would expect errors of say O(1/T ) or some other rate that decays with T , but the infimum of errors is never 0 for a finite value of T .
Let us now develop the quantitative results a little further for the case where T is not sufficiently large, i.e.T < ln ϕ ′ TV .This involves analyzing the error which may be non-zero when T < ln ϕ ′ TV .Proposition 4.9.E T (ϕ) is given by the following optimization problem Notice that the existence of ln ψ ′ implies that ψ is a continuously increasing function, so we may consider the above optimization problem only for the case where ψ is continuously increasing.
It is generally hard to work with optimization problems such as (57), since it involves total variations of logarithms of functions.Below, we formulate its relaxed version.Proposition 4.10.Denote the relaxed optimization problem Proof.We choose v such that v TV ≤ T and Choose ψ such that ln ψ ′ = v and ψ(0) = ϕ(0).Then since ln Sending ε → 0, we arrive at the result.
In general, both (57) and ( 58) is hard to solve.However, for some simple cases of u, the problem (58) have explicit solution.For example, if u itself is a increasing function, then the solution of (58) is 1 2 ( u TV − T ).If u is increasing in [0, s] and decreasing in [s, 1], then the solution of (58) is 1 4 ( u TV − T ).This gives approximation rates for specific cases, but a general investigation of these approximation rates are postponed to future work.

Approximation Results in Higher Dimensions and the Proof of Theorem 2.4
In this section, we will generalize the previous results to higher dimensions.The interesting finding is that in higher dimensions, the fact that A F contains only OP homeomorphisms no longer poses a restriction on approximations in the L p sense.Moreover, the sufficient condition for universal approximation in higher dimensions is closely related to that in one dimension, where the rearrangement dynamics are driven by well functions.We will prove the following result, which together with Corollary 3.7 implies Theorem 2.4.Proposition 4.11.Suppose F is restricted affine invariant and CH(F ) contains a well function.Then for any compact set We notice that for the purpose of approximation, the fact CH(F ) contains a well function allows us to assume without loss of generality that F contains a well function.This is due to Proposition 3.10, for we can always replace F by G := F ∪ {f well } ⊂ CH(F ), with f well being a well function in CH(F ).Proposition 3.10 says that AC(F ) = AC(G), hence we can prove approximation results using G in place of F .

Preliminaries
In order to prove Proposition 4.11, we require a few preliminary results which we state and prove in this subsection.The key approach in proving the proposition is similar to the one dimensional case: we show that we can transform a finite number of distinct source points into a finite number of target points, which are not necessary distinct).More precisely, we show the following proposition, which generalize Lemma 4.3 given in the one dimensional case.Lemma 4.12.Suppose F contains a well function.Let ε > 0 and x 1 , • • • , x m , y 1 , • • • , y m ∈ R n be such that {x k } are distinct points.Then there exists ψ ∈ A F such that |ψ(x k ) − y k | ≤ ε for all k = 1, . . ., m.
Lemma 4.12 follows from the combination of the following two lemmas.Lemma 4.13.Suppose F contains a well function and x 1 , • • • , x m are distinct points.Then given any ε > 0, there exists a flow map ψ ∈ A F and that such that |ψ(

Now we prove these two lemmas.
Proof of Lemma 4.13.To prove the lemma it is enough to show that if there is a pair of two points x j and x k , such that x j I = x k I for some I, we can then find a flow map η ∈ A F such that η can separate x j I and x k I and at the same time, do not cause other pairs of points without initially distinct coordinates to overlap.Without loss of generality, we assume j = 1, k = 2, I = 1, and we only need to show that if x 1 1 = x 2 1 , then there exists an η We briefly explain these requirements.Consider and 2 and 3 implies that #X 1 > #η(X 1 ) ≥ 0, hence #X 1 is strictly decreasing after η.
where each f i : R n → R. Since F is translation invariant, we can assume Ω 1 contains 0 without loss of generality.
Consider the following dynamics In other words, we choose The existence of b I is implied by the boundedness of Ω i .We denote by P t the flow map of this dynamics.We next choose a proper t such that 1,2,3 are satisfied.Since f1 (x 1 ) = 0 and f1 (x 2 ) 0, we deduce that [P t (x 1 )] 1 [P t (x 2 )] 1 whenever t 0. Hence 2 is satisfied with no additional condition.Notice that when |P t (x k ) − x k | ≤ min(ε 1 , d 3 ), then both 1 and 3 is satisfied.Since CH({x k }) is bounded, hence P t − id C(CH({x k })) → 0 when t → 0 by Proposition 3.4.Therefore there exists t 0 > 0, such that P t − id C(CH({x k })) ≤ min(ε 1 , d 3 ).Hence we conclude that η = P t 0 satisfies 1,2,3.
Proof of Lemma 4.14.Without loss of generality, we can assume that for each coordinate index i, y k i are m distinct real numbers, since if not, we can always add a small perturbation to it directly and this will not affect approximation.We also assume Ω 1 contains origin, as we did in the proof of Lemma 4.13.
The basic idea is similar to Lemma 4.13, by choosing a proper linear transformation we can freeze some point while transporting other points.Since we need to control more than 2 points, we can take multiple transformations and evolve them sequentially.We only need to prove for any coordinate index i (without loss of generality i = 1), we can find an η In other words, we choose . a is chosen sufficiently small such that all Ax k are lying in Ω 1 .We denote the flow map by P t (b 2 ), where the dependence of b 2 is emphasized.To simplify our notation, we use P −t (b 2 ) to denote the flow map of ż = −Df (Az + b).

Now we wish to choose r
, where (u l , u r ) be the restriction of Ω 1 , on coordinate index 2. Then a choice of Now {η (k) } are defined recursively.That is, We now prove that η By definition we know that η (k) (x k ) = y k .Hence the induction step is proved.From induction, we know that η satisfies our requirement.

Proof of Proposition 4.11.
Proof.Since K is compact, by extension it suffices to consider the case that K is a hyper-cube.We can for simplicity take the unit hyper-cube K = [0, 1] n , since the general case is similar.Since ϕ ∈ L p (K), by standard approximation theory ϕ can be approximated by piecewise constant functions, i.e. there exists φ = we also denote We also define a shrunken cube where 0 < α ≤ 1.We have K = ∪ i i , and we define K α = ∪ i α i .We also construct a shrinking function in one dimension h α : [0, 1] → [0, 1], such that h α (x) = i N if i N ≤ x ≤ i+α N , and continuously increasing in [0, 1].Using this, we can form a n dimensional shrinking map by tensor product: The idea of the proof of Proposition 4.11 is quite simple: we just contract each grid i into a point p i approximately, then use the lemma above to transform each p i into ϕ i .The latter is discussed in the preliminary step, we here construct an "almost" contract mapping in A F that approximates H α .
Claim: For a given tolerance ε 1 > 0, there exists a flow map Proof of the Claim.Since h is increasing and continuous, we wish to utilize our result in 1 dimension.Concretely speaking, we demonstrate how to restrict the n dimensional control family to one dimension.
Suppose F is a n dimensional control family, then we define for each f = (f 1 , . . ., f n ) ∈ F the dynamics driven by its restriction to first coordinate by ż1 = f 1 (x 1 ), żi = 0 for i ≥ 2, ( i.e., take D = A = diag(1, 0, • • • , 0).Such control systems is denoted as F R,1 (R means restriction and 1 means first coordinate).Clearly F R,1 is closed under composition.Moreover, A F R,1 is coincide with the following set where G is a one dimensional control family, called restriction control family.We use aforementioned notations, ψ for transport p i to ϕ i , and H is the approximate contraction mapping, satisfying the following estimates: Here ε 1 and ε 2 is to be determined later.
As in the 1D case, Proposition 4.11 together with Corollary 3.7 implies Theorem 2.4.

Approximation Results in Tensor-Product Type Dynamical Systems
Sometimes, we are interested in control families generated by a tensor products.Such control families has the advantage that it can be parameterized by scalar functions of one variable, hence allowing for greater flexibility.In this last section, we give some results that apply specifically to tensor product control families.Let us denote ΠF = {f (x) = (g(x 1 ), g(x 2 ), • • • , g(x n )) : g ∈ F }, If we select f 1 , • • • , f S , sending x 2 to g(x 2 ), we know x 1 → x 1 + g(x 2 ) − x 2 .Hence Also, by setting D = diag(1, 0, • • • , 0) and A ij = δ i2 δ j2 , we know that is in A F (n) .Composing two parts yields the result.Proof.Similar to Lemma 4.13, we prove that if x 1 1 and x 2 1 then we can find a η that separates them.The three requirements are the same as we established in Lemma 4.13, hence omitted here.Proof.Similar to proof of Lemma 4.14, we use x 2 to translate x 1 (denoted as η).We find two one dimensional P(•) and Q(•), both continuously increasing, such that x k 1 + P(x k 2 ) − Q(x k 2 ) = y k 1 .By assumptions on F we can find P(•) and Q(•), such that is in A F (n) , and |[η(x k )] 1 − y k 1 | ≤ ε, we conclude that η satisfies our requirement.

3.
We denote by B(x, r) := {y ∈ R n : |y − x| ≤ r} the closed ball of radius r centered at x.If X ⊂ R n is a point set, then we define B(X, r) := x∈X B(x, r).4.Given a uniformly continuous function f , We denote by ω f its modulus of continuity, i.e. ω f (r) := sup |x−y|≤r |f (x) − f (y)|.
Proposition 4.4.Assume the control family F is symmetric and translation invariant, which is equivalent to restricted affine invariant with D = ±1 and A = 1 in Definition 2.3, and that CH(F ) contains a well function.Then, the conclusion in Proposition 4.2 holds.Proof.The proof is almost identical to that of Proposition 4.2 with the well function constructed by averaging two ReLU functions replaced by a general well function contained in CH(F ).Notice that since a well function does not change sign out of I, by choosing a proper sign one can always shrink a finite point arbitrarily close to the interval.This follows from Proposition 3.5.It also follows that if CH(F ) contains all continuous functions, it must contain in particular a well function and so A F has the desired approximation property.However, this is not necessary for approximation, as Theorem 2.5 shows.
, • • • , m are m distinct real numbers.Lemma 4.14.Suppose F contains a well function, x 1 , • • • , x m are distinct points and satisfy the result of Lemma 4.13, that is, {x k i } are m distinct real numbers for any i.Then given any ε > 0 for m target points y 1 Since x 1 and x 2 are two distinct points, we can find a coordinate index I( 1) such that x 1