The Kolmogorov-Arnold representation theorem revisited

There is a longstanding debate whether the Kolmogorov-Arnold representation theorem can explain the use of more than one hidden layer in neural networks. The Kolmogorov-Arnold representation decomposes a multivariate function into an interior and an outer function and therefore has indeed a similar structure as a neural network with two hidden layers. But there are distinctive differences. One of the main obstacles is that the outer function depends on the represented function and can be wildly varying even if the represented function is smooth. We derive modifications of the Kolmogorov-Arnold representation that transfer smoothness properties of the represented function to the outer function and can be well approximated by ReLU networks. It appears that instead of two hidden layers, a more natural interpretation of the Kolmogorov-Arnold representation is that of a deep neural network where most of the layers are required to approximate the interior function.


Introduction
Why are additional hidden layers in a neural network helpful? The Kolmogorov-Arnold representation (KA representation in the following) seems to offer an answer to this question as it shows that every continuous function can be represented by a specific network with two hidden layers [11]. But this interpretation has been highly disputed. Articles discussing * University of Twente, Drienerlolaan 5, 7522 NB Enschede, The Netherlands.
Email: a.j.schmidt-hieber@utwente.nl The research has been supported by the Dutch STAR network and a Vidi grant from the Dutch science organization (NWO). This work was done while the author was visiting the Simons Institute for the Theory of Computing. The author wants to thank Matus Telgarsky for helpful remarks and pointing to the article [28]. the connection between both concepts have titles such as "Representation properties of networks: Kolmogorov's theorem is irrelevant" [7] and "Kolmogorov's theorem is relevant" [18].
The original version of the KA representation theorem states that for any continuous function f : [0, 1] d → R, there exist univariate continuous functions g q , ψ p,q such that ψ p,q (x p ) .
(1. 1) This means that the (2d + 1)(d + 1) univariate functions g q and ψ p,q are enough for an exact representation of a d-variate function. Kolmogorov published the result in 1957 disproving the statement of Hilbert's 13th problem that is concerned with the solution of algebraic equations. The earliest proposals in the literature introducing multiple layers in neural networks date back to the sixties and the link between KA representation and multilayer neural networks occurred much later.
A ridge function is a function of the form f (x) = m q=1 g q (a q x), where g q are univariate functions. The structure of the KA representation can therefore be viewed as the composition of two ridge functions. There exists no exact representation of continuous functions by ridge functions and matching upper and lower bounds for the best approximations are known [20,19,8]. The composition structure is thus essential for the KA representation. This raises then the question, whether the discrepancy between the results indicates that additional hidden layers can lead to new features of neural network functions. A two-hidden layer feedforward neural network with activation function σ, hidden layers of width m 1 and m 2 , and one output unit can be written in the form c j σ(a j x + b j ) + d , with parameters a j ∈ R d , b j , c ,j , d , e ∈ R.
Since the activation is given, this class is much smaller than compositions of ridge functions. However, if σ is continuous and not a polynomial, then, the right hand side of (1.1) can be arbitrarily well approximated by making the network wide, that is, m 1 , m 2 large ( [24], Proposition 3.7).
There are several reasons why the Kolmogorov-Arnold representation theorem has been initially declared "irrelevant" for neural networks in [7]. The original proof of the KA representation in [15] and some later versions are non-constructive providing very little insight on how the function representation works. Although the ψ p,q are continuous, they are still rough functions sharing similarities with the Cantor function. Meanwhile more refined KA representation theorems have been derived strengthening the connection to neural networks. The following KA representation is much more explicit and practical. Theorem 1 (Theorem 2.14 in [4]). Fix d ≥ 2. There are real numbers a, b p , c q and a continuous and monotone function ψ : R → R, such that for any continuous function f : [0, 1] d → R, there exists a continuous function g : This representation is based on translations of one inner function ψ and one outer function g. The inner function ψ is independent of f. The dependence on q in the first layer comes through the shifts qa. The right hand side can be realized by a neural network with two hidden layers. The first hidden layer has d units and activation function ψ and the second hidden layer consists of 2d + 1 units with activation function g.
For a given 0 < β ≤ 1, we will assume that the represented function f is β-smooth, which here means that there exists a constant C, such that |f ( Let m > 0 be arbitrary. To approximate a β-smooth function up to an error m −β , it is well-known that standard approximation schemes need at least of the order of m d parameters. This means that any efficient neural network construction mimicking the KA representation and approximating β-smooth functions up to error m −β should have at most of the order of m d many network parameters.
Starting from the KA representation, the objective of the article is to derive a deep ReLU network construction that is optimal in terms of number of parameters. For that reason, we first present novel versions of the KA representation that are easy to prove and also allow to transfer smoothness from the multivariate function to the outer function. In Section 3 the link is made to deep ReLU networks.
The efficiency of the approximating neural network is also the main difference to the related work [17,22]. Based on sigmoidal activation functions, the proof of Theorem 2 in [17] proposes a neural network construction based on the KA representation with two hidden layers and dm(m + 1) and m 2 (m + 1) d hidden units to achieve approximation error of the order of m −β . This means that more than m 4+d network weights are necessary, which is sub-optimal in view of the argument above. The very recent work [22] uses a modern version of the KA representation that guarantees some smoothness of the interior function. Combined with the general result on function approximation by deep ReLU networks in [29], a rate is derived that depends on the smoothness of the outer function via the function class K C ([0, 1] d ; R), see p.4 in [22] for a definition. The non-trivial dependence of the outer function on the represented function f makes it difficult to derive explicit expressions for the approximation rate if f is β-smooth. Moreover, as the KA representation only guarantees low regularity of the interior function, it remains unclear whether optimal approximation rates can be obtained.

New versions of the KA representation
The starting point of our work is the apparent connection of the KA representation to space-filling curves. A space-filling curve γ is a surjective map To illustrate our approach, we first derive a simple identity. It avoids the continuity of ψ and g, which is the major technical obstacle in the proof of the KA representation. In this case we can choose γ −1 to be an additive function. The proof does moreover not require that the represented function f is continuous.
There exists a monotone function ψ : R → R such that for any function f : [0, 1] d → R, we can find a function g : Proof. The B-adic representation of a number is not unique. For the decimal representation, 1 is for instance the same as 1 = 0.999 . . . To avoid any problems that this may cause, we select for each real number Throughout the following, it is often convenient to rewrite x in its B-adic expansion. Set and define the function The function ψ is monotone and maps x to a number with B-adic representation inserting always d − 1 zeros between the original digits of x. Multiplication by B −p shifts moreover the digits by p places to the right. From that we obtain the B-adic representation We can now define g = f • Ψ −1 and this proves the result.
The proof provides some insights regarding the structure of the KA representation. Although one might find the construction of Ψ : [0, 1] d → [0, 1] in the proof very artificial, a substantial amount of neighborhood information persists under Ψ. Indeed, points that are close are often mapped to nearby values. If for instance x 1 , x 2 ∈ [0, 1] d are two points coinciding in all components up to the k-th B-adic digit, then, Ψ(x 1 ) and Ψ(x 2 ) coincide up to the kd-th B-adic digit. In this sense, the KA representation can be viewed as a two step procedure, where the first step Ψ does some extreme dimension reduction. Compared to low-dimensional random embeddings which by the Johnson-Lindenstrauss lemma nearly preserve the Euclidean distances among points, there seems, however, to be no good general characterization of how the interior function changes distances.
The points of discontinuity of Ψ are all points with finite B-adic representation. The map Ψ defines moreover an order relation on [0, 1] d via x < y :⇔ Ψ(x) < Ψ(y). For B = d = 2, the inverse map Ψ −1 is often called the Morton order and coincides, up to a rotation of 90 degrees, with the z-curve in the theory of space-filling curves ([1], Section 7.2).
If f is a piecewise constant function on a dyadic grid, the outer function g is also piecewise constant with the same number of pieces. As a negative result, we show that for this representation, smoothness of f does not translate into smoothness on g.
The discontinuity of the space-filling map Ψ −1 causes g to be more irregular than f. Many constructions of space-filling curves are known but to obtain a representation of KA type the space-filling curve also needs to be an additive function. The additivity condition rules out most of the canonical choices, such as for instance the Hilbert curve. Below, we use for Ψ −1 the Lebesgue curve and show that this then leads to a representation that allows to transfer smoothness properties of f to smoothness properties on g and therefore overcomes the shortcomings of the representation in (2.2). In contrast to the earlier result, g is now a function that maps from the Cantor set, in the following denoted by C, to the real numbers.
Proof. The construction of ψ is similar as in the proof of Lemma 1. We associate with each The function φ multiplies the binary digits by two (thus, only the values 0 and 2 are possible) and then expresses the digits in a ternary expansion adding d − 1 zeros between each two digits. Define now where the right hand side is written in the ternary system. Because we can recover the binary representation of x 1 , . . . , x d , the map Φ is invertible. Since 2a xr ∈ {0, 2} for all ≥ 1 and r ∈ {1, . . . , d}, the image of Φ is contained in the Cantor set. We can now define the inverse by Φ −1 : C → R and set g = f • Φ −1 : C → R, proving (i).
In a next step of the proof, we show that For that we extend the proof in [1], p.98. Observe that Φ −1 maps Suppose that the first k * d ternary digits of x and y are not all the same and denote by J the position of the first digit of x that is not the same as y. Since only the digits 0 and 2 are possible, the difference between x and y can be lower bounded by |x − y| ≥ 2 · 3 −J + 3 −J where the +3 −J accounts for the effect of the later digits. Thus |x − y| ≥ 3 −J and this is a contradiction with |x − y| < 3 −k * d and J ≤ k * d. Thus, the first k * d ternary digits of x and y coincide.
To prove (iv), take x k = 0 and y k = 2/3 kd . Then, Φ −1 (x k ) = (0, . . . , 0) and Φ −1 (y k ) = (0, . . . , 0, 2 −k ) . Rewriting this yields |g( The previous theorem is in a sense more "extreme" than the KA representation as the univariate interior functions map to a set with Hausdorff dimension log 2/ log 3 < 1. [28] uses a similar construction to prove embeddings of the function spaces generated by circuits into that of neural networks. The advantages of the representation in (2.4) is that smoothness imposed on f translates into smoothness properties of g. A natural question is whether we gain or loose something if instead of approximating f directly, we use (2.4) and approximate g.
Recall that the approximation rate should be m −β if m d is the number of free parameters of the approximating function, β the smoothness and d the dimension. Since g is by (iii) α-smooth with α = β log 2/(d log 3) and is defined on a set with Hausdorff dimension d * = log 2/ log 3, we see that there is no loss in terms of approximation rates since β/d = α/d * . Thus, we can reduce multivariate approximation to univariate approximation on the Cantor set. This, however, only holds for β ≤ 1. Indeed, the last statement of the previous theorem means that for the smooth function f (x) = x, the outer function g is not more than β log 2/(d log 3)-smooth, implying that for higher order smoothness, there seems to be a discrepancy between the multivariate and univariate function approximation.
The only direct drawback of (2.4) compared to the traditional KA representation is that the interior function φ is not continuous. We will see in Section 3 that φ can, however, be well approximated by a deep neural network.
It is also of interest to study the function class containing all f that are generated by the representation in (2.4) for β-smooth outer function g. Observe that if g(x) = x, then f coincides with the interior function which is discontinuous. This shows that for β ≤ 1, the class of all f of the form (2.4) with g a β log 2/(d log 3)-smooth function on the Cantor set C is strictly larger than the class of β-smooth functions. Interestingly, the function class with Lipschitz continuous outer function g contains all functions that are piecewise constant on a dyadic partition of [0, 1] d .
It is important to realize that the space-filling curves and fractal shapes occur because of the exact identity. It is natural to wonder whether the KA representation leads to an interesting approximation theory. For that, one wants to truncate the number of digits in (2.5), hence reducing the complexity of the interior function. Because for the KA representation in Theorem 2, smoothness imposed on f induces smoothness of the outer function g, we obtain an approximation bound that is even independent of the dimension d.
Proof. With φ and g as in Theorem 2, we have that |g( where we used (2.4) and that d p=1 follows as an immediate consequence of the function representation in Theorem 2 (i).

Deep ReLU networks and the KA representation
This section studies the construction of deep ReLU networks imitating the KA approximation in Lemma 4. A deep/multilayer feedforward neural network is a function x → f (x) that can be represented by an acyclic graph with vertices arranged in a finite number of layers. The first layer is called the input layer, the last layer is the output layer and the layers in between are called hidden layers. We say that a deep network has architecture (L, (p 0 , . . . , p L+1 )), if the number of hidden layers is L, and p 0 , p j and p L+1 are the number of vertices in the input layer, j-th hidden layer and output layer, respectively. The input layer of vertices represents the input x. For all other layers, each vertex stands for an operation of the form y → σ(a y + b) with y the output (viewed as vector) of the previous layer, a a weight vector, b a shift parameter and σ the activation function. Each vertex has its own set of parameters (a, b) and also the activation function does not need to be the same for all vertices. If for all vertices in the hidden layers the ReLU activation function σ(x) = max(x, 0) is used and L > 1, the network is called a deep ReLU network. As common for regression problems, the activation function in the output layer will be the identity.
Approximation properties of deep neural networks for composed functions are studied in [12,25,21,13,26,3,23,27,6]. These approaches do, however, not lead to straightforward constructions of ReLU networks exploiting the specific structure of the KA approximation in Lemma 4. To find such a construction, recall that the classical neural network interpretation of the KA representation associates the interior function with the activation function in the first layer [11]. Here, we argue that the interior function can be efficiently approximated by a deep ReLU network. The role of the hidden layers is to retrieve the next bit in the binary representation of the input. Figure 1 gives the construction of a network computing φ K (x) : combining units with linear activation function σ(x) = x and threshold activation function σ(x) = 1(x ≥ 1/2). The main idea is that for x = [0.a x 1 a x 2 . . .] 2 , we can extract the first bit using a x 1 = 1(x ≥ 1/2) = σ(x) and then define 2x − 2σ(x) = 2(x − a x 1 ) = [0.a x 2 a x 3 . . .] 2 . Iterating the procedure allows us to extract a x 2 and consequently any further binary digit of x. The deep neural network DNN I in Figure 1 has K hidden layers and network width three. The left units in the hidden layer successively build the output value; the units in the middle extract the next bit in the binary representation and the units on the right compute the remainder of the input after bit extraction. To learn the bit extraction algorithm, deep networks lead obviously to much more efficient representations compared to shallow networks.
Constructing d networks computing φ K (x p ) for each x 1 , . . . , x d and combining them yields a network with K + 1 hidden layers and network width 3d, computing the interior function ( . The overall number of non-zero parameters is of the order Kd. To approximate a β-smooth function f by a neural network via the KA approximation (3.1), the interior step makes the approximating network deep but uses only very few parameters compared to the approximation of the univariate function g.
A close inspection of the network DNN I in Figure 1 shows that all linear activation functions get non-negative input and can therefore be replaced by the ReLU activation function without changing the outcome. The threshold activation functions σ(x) = 1(x ≥ 1/2) can be arbitrarily well approximated by the linear combination of two ReLU units via If one accepts potentially huge network parameters, the network DNN I in Figure 1 can therefore be approximated by a deep ReLU network with K hidden layers and network width four. Consequently, also the construction in (3.1) can be arbitrarily well approximated by deep ReLU networks.
Throughout the following we write f p := f L p ([0,1] d ) .
(A): Let r be the largest integer such that 2 r ≤ 2K2 Kβp and set S 1 (x) := 2 r (x − 1/2 + 2 −r−1 ) + − 2 r (x − 1/2 − 2 −r−1 ) + and T 1 (x) := 2x. Given S j (x), T j (x), we can then define There exists a ReLU network with architecture (1, (1, 2, 1)) and all network weights bounded in absolute value by 2 r computing the function x → S 1 (x). Similarly, there exists a ReLU network with architecture (1, (2, 2, 1)) computing (S j (x), T j (x)) → S j+1 (x) = S 1 (T j (x) − S j (x)). Since S 1 (x), x ≥ 0, we have that (S j (x)) + = S j (x) and T j (x) = (T j (x)) + . Because of that, we can now concatenate these networks as illustrated in Figure 1 to construct a deep ReLU network computing ) requires an extra layer with two nodes that is not shown in Figure 1. Thus, any arrow, except for the ones pointing to the output, adds one additional hidden layer to the ReLU network. The overall number of hidden layers is thus 2K. Because of the two additional nodes in the non-displayed hidden layers, the width in all hidden layers is four and thus the overall architecture of this deep ReLU network is (2K, (1, 4, . . . , 4, 1)). By checking all edges, it can be seen that all network weights are bounded by 2 r ≤ 2K2 Kβp .
The function g(x) can therefore be represented on [0, 1] by a shallow ReLU network with 2 Kd + 1 units in the hidden layer. Moreover, g(x j ) = g(x j ) for all j = 0, . . . , 2 K . Finally, we bound the size of the network weights. We have x j+1 − x j ≥ 3 −Kd . By Lemma 4, f ∞ = g L ∞ (C) . Since 0 ≤ x j ≤ 1 and for any positive a, a(x−x j ) + = √ a( √ ax− √ ax j ) + , we conclude that all network weights can be chosen to be smaller than 2 f ∞ 2 Kd (D): Figure 2 shows how the neural networks φ K and g can be combined into a deep ReLU network with architecture (2K + 3, (d, 4d, . . . , 4d, d, 1, 2 Kd + 1, 1)) and all network weights bounded in absolute value by max(2 f ∞ 2 Kd , 2K2 Kβp ) computing the function f (x 1 , . . . , x d ) := g( d q=1 3 −q φ K (x q )). From (B) and the interpolation property of g, we conclude that f (x 1 , . . . , As shown in Lemma 4, f ∞ = g L ∞ (C) . Since g is a piecewise linear interpolation of g, we also have g L ∞ ([0,1]) ≤ f ∞ . Decomposing the integral and using the approximation bound in Lemma 4, using for the last inequality β ≤ 1 and a p + b p ≤ (a + b) p for all p ≥ 1 and all a, b ≥ 0.
Recall that for a function class with m d parameters, the expected optimal approximation rate for a β-smooth function in d dimensions is m −β . The previous theorem leads to the rate 2 −Kβ using of the order of 2 Kd network parameters. This coincides thus with the expected rate. In contrast to several other constructions, no network sparsity is required to recover the rate. It is unclear whether the construction can be generalized to higher order smoothness.
Recall that the interior function extracts bits from the input. The fact that deep networks can do bit encoding and decoding efficiently has been used to prove (nearly) sharp bounds of the VC dimension for deep ReLU networks in [2] and also for a different construction to obtain approximation rates of very deep networks with fixed with; see [29].
The function approximation in Lemma 4 is quite similar to tree-based methods in statistical learning. CART or MARS for instance, select a partition of the input space by making successive splits along different directions and then fit a piecewise constant function on the selected partition [9], Section 9.2. The KA approximation is also piecewise constant and the interior function assigns a unique value to each set in the dyadic partition. Enlarging K refines the partition. The deep ReLU network constructed in the proof of Theorem 3 imitates the KA approximation and also relies on a dyadic partition of the input space. By changing the network parameters in the first layers, the unit cube [0, 1] d can be split in more general subsets and similar function systems as the ones underlying MARS or CART can be generated using deep ReLU networks, see also [5,14].
A key observation in the construction of the deep ReLU network in Theorem 3 is that only the weights in the last hidden layer depend on the represented function f. In deep learning it has been observed that the first layers build function systems which can be used for other classification problems. This is exploited in pre-training where a trained deep network from a different classification problem is taken and only the last layer is learned by the new dataset, see for instance [30]. The fact that pre-training works shows that deep networks build rather generic function systems in the first layers. For real datasets, the learned parameters in the first hidden layers still exhibit some dependence on the underlying problem and transfer learning updating all weights based on the new data outperforms pretraining [10].