On the Number of Regions of Piecewise Linear Neural Networks

Many feedforward neural networks (NNs) generate continuous and piecewise-linear (CPWL) mappings. Specifically, they partition the input domain into regions on which the mapping is affine. The number of these so-called linear regions offers a natural metric to characterize the expressiveness of CPWL NNs. The precise determination of this quantity is often out of reach in practice, and bounds have been proposed for specific architectures, including for ReLU and Maxout NNs. In this work, we generalize these bounds to NNs with arbitrary and possibly multivariate CPWL activation functions. We first provide upper and lower bounds on the maximal number of linear regions of a CPWL NN given its depth, width, and the number of linear regions of its activation functions. Our results rely on the combinatorial structure of convex partitions and confirm the distinctive role of depth which, on its own, is able to exponentially increase the number of regions. We then introduce a complementary stochastic framework to estimate the average number of linear regions produced by a CPWL NN. Under reasonable assumptions, the expected density of linear regions along any 1D path is bounded by the product of depth, width, and a measure of activation complexity (up to a scaling factor). This yields an identical role to the three sources of expressiveness: no exponential growth with depth is observed anymore.


Introduction
The ability to train deep parametric models has enabled dramatic advances in a wide variety of fields, ranging from computer vision to natural-language processing [1,2].Many popular deep models belong to the family of feedforward neural networks (NNs), for which the input-output mapping takes the form where L is the number of layers of the NN (referred to as the depth of the NN), f θ k : R d k → R d k+1 is an affine function parameterized by θ k , and σ k is a non-affine activation function.One of the most widespread activation functions in deep learning is the rectified linear unit ReLU(x) = max(x, 0) [3,4,5].With this choice, the mapping is a composition of continuous and piecewise-linear (CPWL) functions, which yields a map that is CPWL too [6].Remarkably, the reverse also holds true: any CPWL function R d → R can be parameterized by a ReLU NN with at most ⌈log 2 (d + 1)⌉ hidden layers [7].The family of NNs generating CPWL functions (referred to as CPWL NNs in the sequel) is broad.It benefits from a large choice of effective activation functions that includes ReLU [5], leaky ReLU [4], PReLU [8], CReLU [9], Maxout [10], linear splines [11,12], GroupSort [13], Householder [14] as well as other components such as convolutional layers, max-and average-pooling, skip connections [15], and batch normalization [16](once the model is trained).While the depth of the architecture is instrumental to overcome the curse of dimensionality [17,18,19], it concurrently deters our understanding of the parameterization when compared to simpler models [20].
The observation that a ReLU NN produces a CPWL function sheds light on its behavior.In effect, a ReLU NN partitions the input domain into affine regions [21,22].The characteristics of the regions are therefore fundamental to grasp the structure of the learnt mapping and there exist different approaches to define them [23,24].The regions can be described as polyhedrons or union of polyhedrons, which results from the continuity and the piecewise-affine property of the mapping.In the case of ReLU NNs, it is common to define activation regions, which are sets of points that fire the same group of neurons.On each activation region, the mapping is affine and these sets are convex [25].Unfortunately, the linear regions in deep NNs are only indirectly specified.While they can be locally described [26] their global delimitation becomes computationally less and less tractable as the dimension increases, which compromises the interpretability of deep NNs.Yet, it is entangled with their ability to overcome the curse of dimensionality.
The successive compositions inherent in deep models prevent us from attributing a specific role to each parameter.The size and the expressiveness of the function space H N generated by a given architecture N is consequently remotely connected to the number of trainable parameters.With their remarkable structure, CPWL NNs benefit from another meaningful descriptor: the distribution of counts of regions of all the mappings that the architecture can produce.Two approaches have been proposed to give a better understanding of this descriptor.
• Upper and lower bound the maximum number of regions of the CPWL mappings generated by a given architecture.The first bounds, given in [6], showed that the maximum number of regions that can be produced by ReLU NNs increases exponentially with their depth.This revealed that deep models have the ability to generate much more complex functions than shallow ones do.The bounds for ReLU NNs have since been refined, for example in [27] and then in [28], and also extended to other NN architectures.For instance, [29] specifies bounds for the maximum number of regions of convolutional NNs (CNNs).It is shown that CNNs produce more regions per parameter than fully connected NNs do.For Maxout NNs, bounds can be derived directly from the ones on ReLU NNs [6,27].However, this approach usually yields loose bounds, as recently shown in [30].The derivation of sharp bounds for Maxout NNs, as proposed in [30], requires to take into account the specificities of the Maxout unit, and it was handled via the use of tropical geometry.
The available bounds show that the maximum number of regions in ReLU and Maxout NNs increases exponentially with their depth.It suggests that deep models have the ability to generate more complex functions than shallow ones do [6,27,7,28,30].
• Upper bound the average number of regions of the mappings generated by ReLU and Maxout NNs.The available bound for ReLU NNs depends on the number of neurons, regardless of whether the NN is deep or wide, and depth does not produce exponentially more regions on average [31,25].In other words, this behavior drastically differs from the maximum number of regions.This new perspective was then recently extended to Maxout NNs, with a similar qualitative conclusion [32].
The existing toolbox of CPWL NNs is broad and likely not complete yet, as hinted by recent works on the MaxMin or more generally GroupSort activation function, in the field of Lipschitz-constrained NNs [13,33], or with the Piecewise Linear Unit (PWLU) [34].Previous studies on the count of linear regions have provided insights on some specific CPWL NNs only, mostly ReLU and Maxout NNs.Their qualitative outcomes turn out to hold true for CPWL NNs in general.We intend to prove this claim with quantitative results in this paper.We want to improve the understanding of the role of the three main ways to increase the expressiveness of CPWL NNs (Figure 1), namely, • depth, which is the number of composed CPWL functions; • width, which relates the input and output dimensions of the composed CPWL layers; • activation complexity, the rationale there being that the expressiveness of a CPWL NN can be heightened by increasing the complexity of its activation functions.This strategy is used with both univariate and multivariate activation functions, and it gave rise to deepspline, Maxout, GroupSort, and PWLU NNs for example.In the remainder of the paper, the complexity of an activation will refer to its number of linear regions.For example, a rank-k Maxout unit has a complexity of k (see Figure 6 for visual examples).Our contributions are as follows.
(i) Generalization of the notion of arrangement of hyperplanes to arrangement of convex partitions with analogous tight bounds on the number of regions.
(ii) Determination of precise bounds on the maximal number of linear convex regions generated by the primary operations of the space of CPWL functions (sum, vectorization, and composition).The compositional upper and lower bound grow exponentially with depth and polynomially with the width and the activation complexity.
(iii) Demonstration that, under reasonable assumptions, the expected number of regions along a 1D path for random CPWL NNs is at most linear with the product of the depth, the width, and the activation complexity (up to an independent factor), which yields equivalent roles to the three descriptors in terms of expressiveness.
The paper is organized as follows: In Section 2, we present the relevant mathematical concepts.In Section 3, we bound from below and from above the maximal number of regions produced by CPWL NNs and, in Section 4, we present a stochastic framework to quantify the average expressiveness of CPWL NNs with random parameters.
2 Mathematical Preliminaries The f k are called the affine pieces of f , and the Ω k the corresponding projection regions.
An example of a CPWL function and of its partition is given in Figure 2. The kth component of a vector-valued CPWL function f ℓ , which is necessarily CPWL as well, will be denoted by f ℓ,k .The space of CPWL functions has the following remarkable properties: • it is closed under compatible compositions; • it is closed under compatible linear combinations; • it is closed under compatible vectorization.Since the function x → max(x) = max(x 1 , . . ., x d ) is CPWL (with d regions), the space of CPWL functions is also closed under max-pooling.

Regions of CPWL Functions and Convex Partitions
The term linear region is frequently used in an ambiguous way and may refer to different mathematical definitions.In the sequel, we shortly present some relevant definitions and discuss them in the context of CPWL NNs.

Projection Regions
We recall that a polyhedron is the intersection of finitely many half-spaces, and that a polytope is a bounded polyhedron.The subsets Ω k in Definition 1 are commonly referred to as projection regions [23,24].The affine pieces of different projection regions are distinct and, since the overall function is continuous, the common points of two neighboring regions lie in a hyperplane.This implies that the Ω k are polyhedrons or unions of polyhedrons.These projection regions might, however, not be connected (Figure 3).

Convex Regions
It is usually preferred to work with (connected) convex regions because of their simpler geometrical structure.We now precisely define convex linear regions of CPWL functions.
Definition 2 (Convex partitions of R d , adapted from [35]).Let n and d be two positive integers.A convex partition of R d is a collection Π = (P 1 , P 2 , ..., P n ) of convex and closed subsets of R d with nonempty and pairwise-disjoint interiors so that the union Convex partitions with n regions are called n-partitions.

Definition 3 (Linear convex partition
The existence of a linear convex partition is guaranteed for any CPWL function but not its unicity.This motivates Definition 4, which gives a precise meaning to the number of convex linear regions for CPWL functions. Definition 4 (Number of convex linear regions).The number κ f of convex linear regions of f is defined as the minimal cardinality of all linear convex partitions of f .A special instance of the linear convex regions for scalar-valued CPWL functions are the uniquely-ordered regions.Each of these regions has the same ordering of the values of the affine pieces f k of f in all its points [24].Uniquely-ordered regions are used to build the lattice representation of a CPWL function [23] and are tightly connected to the GroupSort activation function [13].

Projection vs Convex Linear Regions
In the remainder of the paper we shall keep in mind the following connections between projection and convex linear regions.
• Projection regions can always be partitioned into convex regions so that any upper bound on the number of convex regions also applies to the number of projection regions.Conversely, the number of convex regions can also be upper bounded by the number of projection regions (Proposition 1).
• The majority of commonly used parameterizations have typically convex projection regions.The local parameterization with hat basis functions produces simplicial linear splines whose natural regions are simplices [20] and, therefore, are convex.Other known linear expansions, such as the generalized hinging-hyperplanes model [36], use nonlocal CPWL basis functions that partition the input domain into convex regions.The generated function will produce projection regions that are convex for all sets of parameters except for some specific values that are usually encountered with zero probability in a learning framework.The convex regions are also naturally adapted to compositional models such as ReLU and Maxout NNs as explained it [31].
Proposition 1.Let f : R d → R d ′ be a CPWL function with ρ projection regions.The number κ of linear convex regions of f is no larger than the number of convex regions formed by the arrangement of ρ(ρ − 1)/2 hyperplanes The proof of Proposition 1 is given in A.

Useful Properties of Convex Partitions
We now give a series of lemmas on convex partitions that are used in the proofs of Section 3. The proofs are given in A.
For convenience, we extend the definition of convex partitions of R d to convex partitions of affine subspaces of R d .In particular, a convex partition of an affine subspace E of R d consists of convex and closed subsets of E of dimension dim(E) whose pairwise intersection is of dimension smaller than dim(E) and whose union is E.
Lemma 1 (Projection of a convex partition).Let E be an affine subspace of R d and Π an n-partition of R d .Then, there exists a convex partition Π E of E in R d with no more than n regions such that, for P E ∈ Π E , there is P ∈ Π with P E ⊂ P .
Lemma 2 (Preimage of a convex partition under affine maps).Let f : R d → R d ′ be an affine function and Π be an N -partition of the affine space Figure 4: Arrangement of two convex partitions of R 2 .

Arrangement of Convex Partitions
The known results on the number of convex regions of ReLU NNs are built upon the theory of hyperplane arrangements.In combinatorial geometry, an arrangement of hyperplanes refers to a set of hyperplanes.It is known that the number of connected regions formed by an arrangement of N hyperplanes in R d is at most min(d,N ) k=0 N k [37].This bound is reached when the hyperplanes are in general position: any collection of k of them intersect in a (d − k)-dimensional plane for 1 ≤ k ≤ d and have an empty intersection for k > d.Although this positioning seems very specific, it is qualified as "general" because it almost surely happens when the hyperplanes are randomly generated (with a "reasonable" notion of randomness).When it comes to the study of generic CPWL NNs, the concept of arrangement of hyperplanes lacks precision since only a small fraction of all convex partitions can be seen as arrangement of hyperplanes.We thus introduce the notion of arrangement of convex partitions (Definition 5 and Figure 4) as a generalization, which will prove to be necessary to find the precise bounds given in Section 3. Note that, in the case of an arrangement of N hyperplanes, our terminology differs.Instead of considering the hyperplanes, we rather consider the N 2-partitions they form, which consist of pairs of closed half-spaces separated by the hyperplanes.
The arrangement A(Π 1 , . . ., Π N ) of these partitions is the convex partition whose regions are the A m1,...,m N that have nonempty interiors, where

Maximum Number of Regions Produced by CPWL NNs
In this section, we characterize the largest number of regions that can be generated by simple operations with CPWL functions, including sums, vectorizations, and compositions.In particular, we strictly generalize the known upper and lower bounds on the number of regions of ReLU NNs [38] and Maxout NNs [30] to NNs activated by generic CPWL activation functions.

Upper Bound on the Number of Regions of Arrangements
Operations with CPWL functions imply arrangements of convex partitions, either explicitly, for sums and vectorizations, or implicitly, for compositions.It is straightforward to see that an arrangement A(Π 1 , . . ., Π N ) of N convex partitions Π 1 , . . ., Π N of R d with n 1 , . . ., n N regions cannot yield more than n 1 n 2 • • • n N regions.This naive bound is a polynomial of degree N in n 1 , . . ., n N .In dimension d = 1 one can, however, check that the bound is not sharp: the number of regions is no more than 1 + (n 1 − 1) + • • • + (n N − 1).More generally, the number of regions of the arrangement is bounded by a polynomial in the cardinality of the partitions n 1 , . . ., n N of degree min(d, N ) (Theorem 1), which highlights the role played by the dimension of the ambient space.
Theorem 1 (Arrangements' upper bound).The maximum cardinality β d (n 1 , . . ., n N ) of the arrangement A(Π 1 , . . ., Π N ) of N convex partitions Π 1 , . . ., Π N of R d with cardinality n 1 , . . ., n N is a polynomial in n 1 , . . ., n N of degree min(d, N ).It is given by Moreover, this bound satisfies The expression of the bound in Theorem 1 is based on a broad result of discrete geometry [39].We then relied on Zaslavsky's Theorem [37] and Whitney's formula to construct a specific arrangement of convex partitions for any set of parameters d, N, n 1 , . . ., n N ∈ N\{0} that achieves the bound.The proof of Theorem 1 is given in A. We now discuss the result and its implications.
• Theorem 1 is a generalization of the hyperplane-arrangement bound.Indeed, let us consider the number of regions generated by an arrangement of N hyperplanes: each hyperplane defines a 2-partition of R d and the bound yields , which is known to be exactly the number of convex regions generated by an arrangement of N hyperplanes in general position [37].
• The naive upper bound can be rewritten as . This shows that when N ≤ d, the naive bound is optimal.By contrast, when N > d, the dimension enforces the existence of one or more empty intersections between regions of different partitions.This is illustrated in Figure 5 with a simple example.
• For partitions with the same number n of regions, we introduce the simpler notation • The bound is reached when the partitions Π k are made of the regions of the arrangement of (n k − 1) distinct parallel hyperplanes, where the hyperplanes are in general position when only one per partition is selected (more detailed in the proof in A).
Remark 1.After the disclosure of our work on arXiv, we became aware of [30], which contains highly relevant results on the complexity of Maxout NNs.From Theorem 1, we can directly recover the sharp bound on the number of regions of a shallow Maxout NN recently given in [30,Theorem 3.7].Regarding the converse, i.e. inferring Theorem 1 from [30, Theorem 3.7], we believe that it could perhaps be done but it is not immediate.Indeed, [30,Theorem 3.7] is specific to convex partitions that are the linear partitions of the maximum of affine functions, i.e.only specific CPWL functions.Note that the proofs in [30] rely on tropical geometry, which gives an overall perspective very different from ours.

Single Hidden-Layer: Bound for the Sum and Vectorization Operations
The sum and vectorization of CPWL functions both yield the same bound on the number of linear convex regions.We now give a novel optimal bound in Proposition 2. The proof is given in A.
Proposition 2. Let f 1 , . . ., f N : R d → R be CPWL functions with κ 1 , . . ., κ N convex linear regions.The number of convex linear regions of the sum and of the vector-valued function (f 1 , . . ., f N ) can be bounded by a polynomial in κ 1 , . . ., κ N of degree min(d, N ), namely and these bounds are sharp.
Remark 2. Bounds similar to the ones given in Proposition 2 have recently been derived for one hiddenlayer Maxout NNs [30].The latter work is a specific instance of our setting, in which the CPWL functions considered are the maximum of a finite set of affine functions.
As an illustration of Proposition 2 and Theorem 1, we give some direct implications on the number of regions of some building blocks of CPWL NNs before going deeper.

Ridge Functions Consider the ridge expansion
and the bound is tight.

Max-Pooling
The kth component of the max-pooling operation , where I k is a set of chosen cardinality N of "neighboring" coordinate indices.The number κ mp of convex linear regions of the max-pooling operation is upper-bounded as Generalized Hinging Hyperplanes (GHH) Consider the GHH expansion , where f k p are affine functions and ϵ k = ±1 [36].The number κ GHH of convex linear regions of f G is upperbounded as GroupSort Layer The sort operation takes as input a vector x ∈ R d and simply sorts its components.
For any permutation σ of the set {1, . . ., d}, we define the uniquely-ordered region , where x k is the kth component of x.These regions are convex as intersections of half-spaces and the sort operation agrees on them with distinct affine functions, namely, permutations.We infer that the sort operation has exactly d! linear convex regions and the same number of projection regions.The GroupSort activation was recently introduced and shown to be beneficial in the context of Lipschitzconstrained learning [13].It generalizes the minmax and sort activations: it splits the pre-activation into a chosen number n g of groups of size g s (with n g g s = d), sorts each pre-activation of each group in ascending order, and outputs the combined sorted groups.Each group produces g s !linear convex regions which are invariant along the coordinates that are not in the group.We infer the number of linear convex regions of the GroupSort activation to be κ GS = (g s !) ng , which can be bounded as where we have used the known inequalities (n/2) n/2 ≤ n! ≤ n n .The bounds support the intuition that larger group sizes generate more regions than smaller ones.Note, however, that they simultaneously increase the computational complexity of the layer.
PWLU The PWLU [34] is a learnable CPWL activation function with control points placed on a grid and with fixed linear regions (namely simplices whose vertices are control points).In its 2D version, a PWLU φ PWLU : R 2 → R with M 2 control points has 2(M − 1) 2 linear regions that are triangles, see Figure 6 for an illustration with M = 4, and see [34, Figure 5] for a more generic representation of PWLUs.Consider the one-hidden layer The number κ PWLU of convex linear regions of this PWLU NN is upper-bounded as Our framework also allows one to derive bounds for NNs activated with higher dimensional PWLUs, but we are not aware of their use in practice.

Multiple Hidden-Layers: Compositional Bounds
The architecture of a CPWL NN R d1 → R d L+1 is specified by its depth L, its layer dimensions (d 1 , . . ., d L+1 ), and its activation complexity κ ℓ,k at each node (ℓ, k), which is naturally depicted by the number of linear convex regions of the kth component of the ℓth composed function (Figure 6).Theorem 2 below yields precise bounds on the maximal number of convex linear regions of any CPWL NN.It is complemented by Corollary 1 which tackles the following question: given a CPWL NN with fixed input and output dimensions, how is the maximal number of regions related to depth, width, and activation complexities?Our results confirm and generalize the following qualitative intuitions: • (i) depth can exponentially increase the complexity of the generated function; • (ii) width and activation complexity, on the contrary, can only increase the number of linear convex regions of the generated function polynomially; • (iii) layers with small dimensions reduce the maximal number of regions produced by the NN, especially if they are located toward the input of the NN.This bottleneck effect stems from the upper bound given in Theorem 1.
Note that (i) is well known and was first proven in [6], (ii) is in agreement with the recent results in [30] obtained for the particular instance of Maxout NNs, and (iii) was observed for ReLU NNs in [40,27].
where β • (•) is the upper bound on the number of regions of an arrangement of convex partitions (Theorem 1) and where There, T d ℓ denotes the set of all mappings from {k ∈ N : where Corollary 2. The bounds given in Theorem 2 and Corollary 1 also apply to the maximal number of projection regions of a CPWL NN and, equivalently, to its maximal number of distinct affine pieces.
The proofs of Theorem 2 and of its corollaries can be found in A.4.

Application to Some Popular CPWL NNs
In the sequel, we consider the CPWL NN We now apply Theorem 2 to bound the maximal number of convex linear regions produced by the most popular architectures.Note that the lower bound given in Theorem 2 only applies to CPWL NNs with pointwise activation functions.This includes ReLU and, more generally, deepspline NNs.The reason is that the lower bound of Theorem 2 was found by building a deepspline NN.
ReLU/PReLU/Leaky ReLU NNs In a ReLU NN, the kth component f ℓ,k of f ℓ takes the form f ℓ,k : x → ReLU(w ℓ,k x + b ℓ,k ) and has two convex linear regions (half-spaces).Theorem 2 then yields which is the bound proposed in [6].However, it is not the tightest upper bound known [27].The reason is that the ReLU function is only a very specific instance of 1D CPWL functions with 2 linear regions: the image of the half real line (−∞, 0] by the ReLU function is only the singleton {0}.This reduces the apparent dimension of the problem for any region that would not fire all neurons.This observation was exploited in [27] to get a better estimate.In that sense, ( 18) is better tailored to PReLU and Leaky ReLU NNs, which have activations with two nonzero slope regions.
GroupSort NNs To bound the number of linear convex regions of a GroupSort NN [13] with the same group size g s in each layer, we consider for each composition the arrangement of d ℓ+1 /g s convex partitions (one per group) with g s !regions each and obtain that These bounds provide an intuition of the role of the hyperparameters of CPWL NNs in terms of expressiveness.
For instance, the number of units in Maxout NNs plays a role in the bound that is analogous to that of the number of knots of the activation functions in deepspline NNs.However, these two architectures do not induce the same implementation complexity.To increase the activation complexity by one unit, Maxout requires the inclusion of an additional learnable multidimensional affine function, whereas deepspline simply requires the insertion of one more knot to a 1D CPWL function.
While it is tempting to compare architectures on the sole basis of their expressiveness, it can be very delicate to draw generic practical conclusions from this comparison.The final choice of an architecture is guided by a tradeoff between expressiveness, computation complexity, memory usage, and ability to learn over the functional space.For instance, an increase in the group size of a GroupSort activation function increases the expressiveness with no additional parameters, but usually small group sizes are favored to keep the computational impact limited.

Expected Number of Regions Produced by CPWL NNs Along 1D Paths
In Section 3, we found that depth increases the expressiveness of the model exponentially when the corresponding metric is the maximal number of regions.However, the compositions that achieve the lower bound of Theorem 2 could be very specific and hard to reach in practice.
The composition (f 2 • f 1 ) of two CPWL functions results in the partitioning of each linear region of f 1 into smaller linear pieces.The successive compositions (f ℓ • • • • • f 1 ) have regions that are obtained from splitting of the regions of the previous compositions (Figure 7).As such, we expect the image of each region of the composition to shrink when depth increases, at least for compositions with reasonable gradients magnitude (∼ 1).The extent of the split should therefore depend on the depth of the composition.The more there are regions produced by the first compositions, the fewer splits each region will undergo after the next compositions.This intuition rules out an exponential growth of the average number of regions with ℓ.This effect has already been revealed for ReLU NNs in [25] and recently extended to Maxout NNs in [32].We now aim to prove that it is universal to NNs with any type of CPWL activations under reasonable assumptions.
Figure 7: Linear region-splitting process for a CPWL NN with absolute-value activation function and randomly generated parameters.The figure shows the linear regions of the mapping after k activation layers, for k = 0, . . ., 9. From one layer to the next, the regions are partitioned into smaller pieces.The number of linear regions is indicated in parentheses and suggests that the splitting process saturates with depth.The regions were numerically identified by evaluating the Jacobian of the mapping on a very fine grid.
Throughout this section, we consider CPWL functions f θ parameterized by random parameters θ.We shall specify the parameterization and characteristics of the underlying stochastic model whenever needed.The natural extension of Section 3 is to estimate the expected number of regions of compositions of randomly generated CPWL functions.This task seems unfortunately very complex as it mixes stochasticity and combinatorial geometry.It would involve an overly heavy framework with the risk to lose focus on the high-level intuition.Instead, we propose a simpler but closely related metric: the expected density of regions along 1D paths.This quantity is valuable in practice since it gives the expected number of linear regions that are found in-between two locations of the input space that are 1 unit distance apart.In addition, the inverse of the density gives a rough measure of the average size of a linear region along one direction.

Knot Density
The characterization of the density of linear regions of CPWL NNs along one-dimensional paths requires the introduction of some mathematical concepts.
1D CPWL Path A 1D CPWL path denotes in the sequel any CPWL mapping γ : S → R d on a closed segment S = [a, b] (a, b ∈ R) with finitely many knots.This path will serve to "navigate" within the input domain of CPWL NNs for counting the linear regions.The length of γ is computed as Len(γ) := t∈S ∥ dγ dt ∥ 2 dt.Note that γ is a parameterization of what is often referred to as a polygonal chain.In this Section we only study the density of linear regions along CPWL paths because of their simplicity and connections with CPWL NNs, e.g. the composition of a CPWL path and a CPWL NN is again a CPWL path.This choice is, however, not very restrictive since CPWL paths can approximate any continuous path arbitrarily close.
Knot Density Along a Path Given a 1D CPWL path, the goal is to characterize the complexity of a CPWL NN along it.Informally, the number of knots of a CPWL NN along the path is the number of times the path crosses regions.This intuitive definition is unfortunately not sufficiently precise since it does not specify how to count knots when some nonzero-length portion of the path γ is contained in a face of a linear region, see in Figure 8 for an example.To avoid any ambiguity, we introduce the characteristic function of a CPWL f along γ, where the sets Ω k are the projection regions of f and 1 Ω k (γ(t)) = 0 otherwise.Since γ is continuous with finitely many knots, and since the projection regions are unions of polyhedrons, φ γ f is a binary function with finitely many jumps, see Figure 8.Note that φ γ f uniquely identifies the supporting affine function active at location γ(t).Hence, in practice, the knowledge of f (γ(t)) and ∇f (γ(t)), which is computable in any deep-learning library, suffice to identify φ γ f (t).
We stress that alternative definitions of the knot density that correspond to the same informal intuition are possible, but they would differ when the path γ follows the boundaries of some projection regions.In the sequel, this will not matter since, in any reasonable stochastic framework, the path does not follow some boundaries almost surely.
The knot density along a path is subadditive for the sum and vectorization of CPWL functions, and can be bounded for the composition of CPWL functions, see Proposition 3 and 4 and B for the corresponding proofs.
Proposition 3. Let γ : S → R d be a 1D CPWL path on the segment S ⊂ R and let f 1 : R d → R d ′ and f 2 : R d → R d ′ be two CPWL functions.The knot density along γ of either the sum f 1 + f 2 or of the vectorized function (f 1 , f 2 ) is bounded as where λ γ f1 and λ γ f2 are the knot density of f 1 and f 2 along γ, respectively.
Proposition 4. Let γ : S → R d1 be a 1D CPWL path on S ⊂ R and let f 1 : R d1 → R d2 and f 2 : R d1 → R d2 be two CPWL functions.Then, the knot density of f 2 • f 1 on γ is bounded as where λ γ 1 is the knot density of f 1 along γ and λ γ 2 the one of f 2 along f 1 • γ.

Knot Density of CPWL Layers
The goal of this subsection is to show that the knot density is well behaved for classical CPWL NN layers, which justifies the assumption (i) of Theorem 3 and Corollary 3. The proofs can be found in C.
In particular, when b and the components of w are normally distributed with zero mean and standard deviation σ b and σ w , respectively, the following tighter bound holds true When the ReLU activation function is replaced by a 1D CPWL function with a given number K of knots, we conjecture that the bounds can simply be multiplied by K.
Proposition 6 (Knot density -Maxout).Let ((w k1 , . . ., w kd ), b k ) ∈ R d × R for k = 1, . . ., K be independent random variables with bounded probability density functions ρ b for any b k and ρ w for all components w kl of w k , which are i.i.d.over both k ∈ [K] and l ∈ [d].Then, the expected knot density of the rank K Maxout unit f : x → max k=1,...,K (w T k x + b k ) along any 1D CPWL path γ is bounded as where σ w is the standard deviation of any w kl .In particular, when b k and w kl are normally distributed with zero mean and standard deviation σ b and σ w , respectively, a tighter bound holds true, according to The bounds provided in Proposition 6 grow quadratically in terms of the number of Maxout units; we conjecture the existence of a tighter linear bound.
Proposition 7 (Knot density -GroupSort).Let (w k , b k ) be as in Proposition 6.Then, the expected knot density λ γ f of the GroupSort layer f : R d → R d : x → GS ng,gs (W x), where GS ng,gs is the GroupSort activation with n g groups of size g s , is bounded along any 1D CPWL path γ as where σ w is the standard deviation of any w k,l .In particular, when b k and w kl are normally distributed with zero mean and standard deviation σ b and σ w , respectively, a tighter bound can be given as For ReLU and Maxout layers with multidimensional outputs, the bounds given in Proposition 5 and 6 are simply multiplied by the output dimension (see Proposition 3).We note that all bounds proposed take the form (κW σ w sup t∈R ρ b (t)), where the prefactor κ only depends on the activation function and W is the number of outputs of the layer.The learnable parameters are typically initialized by sampling a uniform or normal distribution with the same characteristics for the biases and the weights of a same layer.In this case, although the characteristics of the distribution usually depend on the input and output dimensions of the layer [8], the quantity σ w sup t∈R ρ b (t) is determined only by the distribution: normal or uniform (since, for these distributions, the supremum of the probability density function is proportional to the standard deviation).All in all, it should be reminded that • the expected knot density is well defined for learnable CPWL layers; • with standard initialization methods, it is reasonable to assume that the expected knot density of the components of a CPWL layer depends neither on its width nor on the total depth of the NN (at least at initialization stage).
It is tempting to take advantage of the previous results to adjust the distributions of the weights and biases at initialization in the hope to increase the upper bound and, possibly, the knot density of a NN.The effect is, however, subtle: for instance, if one narrows the distribution of the biases, the bound increases as sup t∈R ρ b (t) increases.While this may increase the average knot density at some specific locations, it will inevitably decrease it elsewhere.

Bounds on the Expected Knot Density of CPWL NNs
In Theorem 3 and Corollary 3, we introduce two different settings to bound the expected knot density of CPWL NNs.Theorem 3 highlights the role played by the gradients of the composed layers: larger gradients allow for a more intense splitting process within the composition and should lead to a greater knot density.With Corollary 3, we propose a more practical analysis: given a learning task that dictates the input and output dimensions, how does the expected density of linear regions along 1D curves relate to the depth, width, and activation complexity of the CPWL NN?In accordance with the intuition given in Figure 7, depth cannot provide exponentially more linear regions on average.This key result relies mainly on the assumption (ii), which is discussed in Section 4.3.1.
The directional derivative of the function f in the direction u is denoted by D u {f }, and the proofs of the results proposed in this section can be found in D.
Then, on any 1D CPWL path γ, the expected knot density of the CPWL NN is bounded as where D * 0 = max(D 0 , 1).
The proof of Theorem 3 relies on Lemma 4. In the bound presented in this lemma, the expected value is evaluated before taking the supremum, whilst a switch of the order of the operators would yield a much looser bound.Lemma 4. Let γ : S → R d be a 1D CPWL path and f θ : R d → R d ′ a CPWL function parameterized by the random variable θ such that, for any x, u ∈ R d , f θ is differentiable at x in direction u with probability 1.Then, the expected length of the 1D CPWL path

Discussion of the Compositional Bounds
Our approach relies on the independence of the randomly generated CPWL functions.It usually holds at initialization stage, but it is not true anymore in the learning stage.While this can be regarded as a limitation, it is a legitimate and convenient way to explore and depict the whole function space that a given architecture gives access to.Assumption (i) of Theorem 3 and its corollary (bounded expected knot density of the learnable CPWL components) have been discussed in details in Section 4.2, where it was remarked that it is reasonable to assume that λ 0 is independent of W and L.
Theorem 3 and Corollary 3 differ on Assumption (ii) (well behaved gradients).While the assumption of the theorem seems more natural at first sight (gradient controlled for each layer), the one of the corollary is closer to practical observations.Assumption (ii) of Corollary 3 was invoked to bound the expected length of the image of any finite-length 1D CPWL path, independently of the depth of the composition.While early works suggested that this expected length grows exponentially with depth [42], it was recently shown otherwise in a more realistic setup, both theoretically and experimentally [43].For instance, for ReLU NNs, with the usual 2/fan-in weight variance, depth typically does not affect the expected length [43].More generally, a control of the magnitude of the directional derivatives that is independent of the depth is highly desirable in the learning stage for a stable back-propagation algorithm [8] and, in the inference stage, to produce robust models [44].In short, it is also reasonable to assume that the parameter D 0 depends neither on W nor on L.
The previous discussion suggests a simple and important bound on the density of regions of CPWL NNs.It attributes an identical role to the three sources of complexity, namely depth, width, and activation complexity.
The quality of the proposed bounds seems to be completely determined by the tightness of the bounds in Assumptions (i) and (ii).Based on the proofs of Theorem 3 and Corollary 3, we believe that the compositional bounds are sharp provided that the expected knot density is uniform (i.e., the same for any 1D CPWL curve) and that the expected norm of the directional derivative is uniform and isotropic within the NN.

Conclusion
In this work, we have investigated the role of depth, width, and activation complexity in the expressiveness of CPWL NNs.By invoking results from combinatorial geometry, we have found that depth has a predominant role over width and activation complexity: it is the only descriptor able to increase the number of linear regions exponentially.However, this exponential growth is only observed for the maximal number of regions.Indeed, when exploring the whole function space produced by a given CPWL NN, we have found that, on average, the number of regions along a line is bounded by a quantity that only depends on the product of the three descriptors.In that perspective, the three complexity parameters have an identical role: no exponential behavior with depth is observed anymore.
The ability to train deeper and deeper NNs has led to major improvements in machine learning.However, depth comes at a price in applications where the NN needs to be stable, for instance by constraining its global Lipschitz constant.In such settings, we therefore believe that complex learnable activations should always be regarded as a valuable opportunity to increase substantially the expressiveness of the model without resorting to deeper NNs [45,46].

Appendices A Proofs for Section 3 A.1 Number of Convex vs Projection Regions
Proof. of Proposition 1 The first inequality follows from the fact that there cannot be fewer linear convex regions than affine pieces.Consider two neighboring projection regions Ω k and Ω p of f , where 1 ≤ k < p ≤ ρ, with corresponding affine pieces yields convex regions on which f is affine since these regions do not contain boundary points.The number of such regions is, therefore, an upper bound on the number of convex regions of f .It is known from [37] that the number of convex regions formed by an arrangement of N hyperplanes in R d is at most min(d,N ) k=0 N k .Hence, for ρ(ρ − 1)/2 > d, we directly reach the announced result.Otherwise, the bound yields Proof. of Lemma 1 Let e = dim(E).The natural candidate for Π E is the partition which is unfortunately not necessarily a proper convex partition.Indeed, if E contains an e-face of a region, then some elements of Π ′ will not have disjoint interiors.Since the regions of Π are polyhedrons, there exist a given number n H of distinct boundary hyperplanes H p = {x ∈ R d : a T p x + b p = 0} and such that for each P k ∈ Π, there exists a subset We now consider a mapping ϕ that assigns to each hyperplane H p a unique region ϕ(p) such that p ∈ I ϕ(p) .
We can now define n new pairwise-disjoint convex regions as It is clear that ∪ n k=1 P ′ k = R d .From these new regions, one can eventually build the proper convex partition By construction, all regions of Π E are closed with nonempty interiors; their union covers E. Let P E,1 = P ′ k1 ∩E and P E,2 = P ′ k2 ∩E be two (nonempty) regions of Π E .We have that Int(P E,1 )∩Int We, therefore, proved that Π E is a convex partition of E; it has at most n regions and is such that, for any P E ∈ Π E , there is P ∈ Π with P E ⊂ P .
Proof. of Lemma 2 Let P ∈ Π. Recall that P is a closed and convex subset of the affine space f (R d ) with dimension dim(f (R d )).We first prove that f −1 (P ) meets the requirements to form a convex partition of R d .
• The continuity of f implies that f −1 (P ) is closed.
• We decompose the input space as the direct sum It is clear that, for any x ∈ f −1 (P ) and y ∈ ker(A), we have that x + y ∈ f −1 (P ), which implies that dim(Proj U (f −1 (P ))) = ker(A).In addition, we use the fact that f restricted to U has full rank and write dim(Proj . All in all, we have proved that dim(f −1 (P )) = d, which implies that the regions of f −1 (Π) have nonempty interiors.
Proof. of Lemma 3 The result stems from the fact that the rank of a product of matrices is bounded by the smallest rank of these matrices.

A.2 Upper Bound on the Number of Regions of Arrangements
Proof. of Theorem 1 First, we prove that the expression given in the theorem is an upper bound.To that end, we need to formalize our problem with the notion of abstract simplicial complex so as to focus solely on the combinatorial structure of the task and be compliant with the formalism of [39].Let Π * k = {int(P ) : P ∈ Π k }, where int(P ) denotes the interior of P in R d , and let F = ∪ N k=1 {P * : P * ∈ Π * k } be the set that contains the elements of the N sets Π * k .The the nerve K of F is defined as In simple words, K is made of all the nonempty intersections of sets in any of the Π * k .The nerve of an open covering is an abstract simplicial complex which, therefore, applies to K since F is an open covering of R d .This more simply follows from the definition of an abstract simplicial complex: it is a family of sets that is closed under taking subsets.In the sequel, we need K to be a d-representable simplicial complex, which is granted because it is the nerve of a finite family of convex sets in R d (more details in [39]).In our problem, the faces of dimension 0 of the complex, also known as vertices, are the elements of F.More generally, a face of K of dimension p is a nonempty intersection of p + 1 elements of The dimension of this sub-complex, which is the largest dimension of its faces, is 0 because the elements of Π * k are disjoint.We note that the interior of the regions of the arrangement of the convex partitions are (N +1)-faces of the abstract simplicial complex K, which are also called 1-colorful faces, where 1 = (1, . . ., 1) ∈ R N specifies that each region of the arrangement is built from one region per partition.We are therefore looking to bound the number f 1 (K) of 1-colorful faces of the complex K. Since we have now fully translated our problem into the framework of [39], we can apply [39,Theorem 10] to F. The parameter r = (r 1 , . . ., r N ) can be chosen so that dim(K[Π * k ]) ≤ (r k − 1).Therefore, we simply choose r = 1 and obtain that where and In our problem, k = 1 and With r = 1, we have that which proves that the bound given in the Theorem holds true.Now we show that this upper bound is sharp.To that end, consider that each partition Π k is made of the regions of the arrangement of the (n k − 1) distinct parallel hyperplanes H k q for q = 1, . . ., (n k − 1) so that the hyperplanes are in general position when only one per partition is selected.Recall that N hyperplanes are in general position if any collection of k of them intersect in a (d − k)-dimensional plane for 1 ≤ k ≤ d and have empty intersection for k > d.The number of regions of the arrangement A(Π 1 , . . ., Π N ) is exactly the number of regions of the arrangement of all the hyperplanes H k q for q = 1, . . ., (n k − 1) and k = 1, . . ., N .Following Zavalasky's Theorem, the number of regions can be computed by #R(A) = (−1) where χ A is the characteristic polynomial of the arrangement.There is no need here to define the characteristic polynomial in detail since Whitney's formula provides a direct way to evaluate it as The subsets B ⊂ A that have a nonempty intersection can be written as − 1] where i = 1, . . ., p and 0 ≤ p ≤ d.This holds because, for q ̸ = q ′ , H k q ∩ H k q ′ = ∅.Note that, by convention, the set B = ∅ is also considered in the sum.Because of the particular choice of the hyperplanes, for a given p, B is the nonempty intersection of p hyperplanes and there are (50)

A.3 Sum and Vectorization
Proof. of Proposition 2 Let Π k be a linear convex partition of f k for k = 1, . . ., N .On each region of the arrangement A(Π 1 , . . ., Π N ), the f k are affine, and so is their sum and their vectorization.This implies that A(Π 1 , . . ., Π N ) is a linear convex partition of both the sum and the vectorization of the scalar-valued CPWL functions, which shows that β d (κ 1 , . . ., κ N ) is a valid upper bound on the number of convex linear regions.We now prove that the bounds are sharp.First, consider N convex partitions Π k where each Π k is made of the regions of the arrangement of (κ k − 1) distinct parallel hyperplanes , and such that the hyperplanes are in general position when only one per partition is selected.In such a way, the arrangement A(Π 1 , . . ., Π N ) has exactly β d (κ 1 , . . ., κ N ) convex regions (see proof of Theorem 1).Second, for each partition, we consider a CPWL function φ k : R → R with knots (b p k ) κ k −1 p=1 and κ k distinct affine pieces (φ p k ) κ k p=1 .In the sequel, the affine pieces are written φ p k : x → a p k x + c p k .The function f k : x → φ(w T k x) has exactly n k linear convex regions and Π k is a linear convex partition of it.The construction implies that A(Π 1 , . . ., Π N ) is a linear convex partition of both (f 1 + • • • + f N ) and (f 1 , . . ., f N ).Because the affine pieces of each φ k are distinct, the vector-valued function (f 1 , . . ., f N ) will agree with distinct affine pieces on each region of A(Π 1 , . . ., Π N ), which proves that this partition has the minimal number of linear convex regions.This yields CPWL functions such that κ (f1,...,f N ) = β d (κ 1 , . . ., κ N ).On the contrary, in the case of the sum (f 1 + • • • + f N ), there is on the contrary no guarantee that A(Π 1 , . . ., Π N ) is a partition with the minimal number of linear convex regions.To ensure that the regions of this partition have different affine pieces, it is sufficient to choose the pieces (φ p k ) such that ).An explicit choice is a p k = pm k−1 with m = max(κ k ).The biases b p k are then set such that φ p k is continuous.In such a way, the slope of . This number can be represented in base m as "(p N • • • p 1 ) m ", which shows that it is uniquely related to the choice of indices (p k ).Although this choice seems very specific, a random choice of the slopes would also satisfy the condition almost surely.We have therefore found a collection of CPWL functions whose sum has exactly β d (κ 1 , . . ., κ N ) linear convex regions.
Second, we propose a construction inspired from [6] to derive the lower bound given on the maximal number of regions.Let the sawtooth function sw p of order p be the unique 1D CPWL function with knots located at k/p for k = 1, . . ., (p − 1) that satisfies sw p (k/p) = 1 2 (1 − (−1) k ) for k = 0, . . ., p.The key properties of the sawtooth function of order p that will prove useful in the sequel are • it has p projection regions that are also convex linear regions; • it can be decomposed as where φ k,p = p(x + 2(−1) p |x − k/p|) is a CPWL function with 2 projection regions; • the composition of sawtooth functions is a sawtooth function whose order is the product of the orders of the composed functions, as in sw p • sw q = sw pq , for p, q ∈ N.
We now define the nonlinear pointwise function ϕ ℓ : R be a partition of the set {1, . . ., p ℓ,r }, where the cardinality of the subsets is in one-to-one correspondence with {κ ℓ,q } q∈τ −1 ℓ ({r}) .In this way, we assign to each k ∈ {1, . . ., d ℓ + 1} a set of indices J ℓ,τ ℓ ,i k that allows us to define the kth component of ϕ ℓ as This ensures that From the pointwise property of ϕ, we deduce that, for any t 1 , . . ., t d * ∈ R, which means that u ℓ+1 • ϕ • v ℓ is a pointwise multivariate function with 1D sawtooth components of order p ℓ,r for r = 1, . . ., d * .We denote it by sw p ℓ with p ℓ = (p ℓ,1 , . . ., p ℓ,d * ).The function f ℓ of the NN is chosen to be . This shows that f ℓ,k has the same number of projection regions as ϕ ℓ,k (κ ℓ,k ) whenever w ℓ,k ̸ = 0.
All in all, we have that We now note that there are no fewer projection regions of f where q = (q 1 , . . ., q d * ) and q r = L ℓ=1 p ℓ,r .The properties of the sawtooth functions and the special form of u 1 yields the projection regions for h as where i r = 0, . . ., (q r − 1) for r = 1, . . ., d * .In summary, the number of projection regions of the constructed CPWL NN is at least The conclusion is reached by noticing that the reasoning does not depend on any property of the mappings τ ℓ : one can therefore pick the ones that yield the largest lower bound.
compute the probability P(kt γ f = 1) that f has a knot along γ.The hyperplane {x ∈ R d : w T x + b = 0} intersects the line {x 0 + tu : t ∈ R} for t 0 such that w T (x 0 + t 0 u) + b = 0 or, equivalently, b = (−w T (x 0 + t 0 u)).In order to have a knot along γ |S , t 0 has to lie in S. For a given w, this implies that b should be in an interval of length |S||w T u|, more precisely [−w T x 0 , |S|w T u − w T x 0 ] if w T u < 0 and [|S|w T u − w T x 0 , −w T x 0 ] otherwise.Therefore, P(kt γ f = 1|w) ≤ sup t∈R ρ b (t)|S||w T u|.From the independence of the random variables and from the fact that kt γ f = 0 or kt γ f = 1 almost surely, we infer that E[λ In addition, suppose that the components w k are independent and normally distributed with standard deviation σ w .The random variable w T u is also normally distributed with standard deviation σ w (since ∥u∥ 2 = 1).We can now compute explicitly E[|w T u|] = σ w √ 2/ √ π based on the properties of half-normal distributions.The result is extended to any polygonal chain through the linearity of the expectation operator and by application of the result to the finitely many pieces of the polygonal chain.
Proof. of Proposition 6 A knot of f along a line γ must lie on a hyperplane H p,q = {x : (w p − w q ) T x + (b p − b q ) = 0} with 1 ≤ p < q ≤ K, since elsewhere the Maxout unit is affine.Therefore, the expected knot density is bounded as where we have taken advantage of the results derived in the proof of Proposition 5 to bound the probability that a randomly generated hyperplane intersects a segment of length |S|, along with the independence of the random variables.We also notate (A ̸ = ∅) to encode the variable that takes the value 0 if A = ∅ and 1 otherwise.When the random variables are normally distributed, the reasoning is similar to the one in the proof of Proposition 5.
Proof. of Proposition 7 A knot of f along a line γ must lie on a hyperplane H p,q = {x : (w p − w q ) T x + (b p − b q ) = 0}, where 1 ≤ p < q ≤ K and where p, q belong to the same sorting group, since elsewhere the GroupSort layer is affine.One can now follow the same steps as those in the proof of Proposition 6 with n g gs 2 = n g g s (g s − 1)/2 = d(g s − 1)/2 hyperplanes.

D Proofs of the Bounds on the Expected Knot Density of CPWL NNs
Proof. of Lemma 4 In what follows, the technical developments originate from the fact that f θ is not differentiable everywhere.The function f θ • γ is the composition of two CPWL functions, hence it is CPWL and, therefore, differentiable for almost every t ∈ S. Note, however, that we cannot assert that the Jacobian of f θ is well defined at γ(t) for almost every t ∈ S. Indeed, whenever γ follows the boundary of two projection regions of f θ , the Jacobian of f θ along γ becomes ill-posed.This is why the notion of directional derivative is better suited.The characteristic function φ γ f θ of f θ along γ is piecewise-constant on S: We can partition S into finitely many convex regions where φ γ f θ is constant.Let P ⊂ S denote one of these regions and let Q ⊂ S be a linear convex region of γ.Following the definition of the characteristic function, there exists a projection region Ω of f θ such that γ(int(P ∩ Q)) either lies entirely in the interior of Ω, or entirely on its boundary.In the first case, f θ is differentiable in γ(int(P ∩ Q)) and we have that (f θ • γ) ′ (t) = J f θ (γ(t))γ ′ (t) = D γ ′ (t) {f θ }(γ(t)) for t ∈ int(P ∩ Q).In the second case, γ is differentiable on int(P ∩ Q) as well, but the Jacobian of f θ is undefined.Fortunately, the directional derivative of f θ is well defined along γ(t) since, for any t ∈ int(P ∩ Q), there exists ϵ > 0 such that τ → f θ (γ(t) + τ γ ′ (t)) is affine on (−ϵ, ϵ).All in all, the relation (f θ • γ) ′ (t) = D γ ′ (t) {f θ }(γ(t)) is well defined for any t ∈ int(P ∩ Q) and, more generally, for almost any t ∈ S because of the properties of P and Q.We can now write that where we have used Tonelli's theorem to interchange the expectation and the integral.
With Lemma 4, we have that We now apply the law of the iterated expectation to obtain that E kt where the inequality follows from the first assumption of the theorem, the application of Proposition 3, and requires the independence of the random variables.We can now apply Lemma 4 recursively to F ℓ and invoke the second assumption of the theorem to infer that E kt All in all, we just proved that which reads in term of linear densities as This recurrence relation directly yields the announced bound.
Proof. of Corollary 3 The proof is similar to the proof of Theorem 2 except that, with the different second assumption, the quantity E θ1,...,θ ℓ−1 [Len(F ℓ−1 • γ)] can be bounded by D 0 Len(γ) (Lemma 4).In the end the recurrence relation ( 71) is changed into

Figure 1 :
Figure 1: The three sources of complexity of CPWL NNs.

Figure 2 :
Figure 2: An R 2 → R CPWL function and its corresponding partition of the input space.

1DFigure 5 :
Figure 5: Arrangement of two convex partitions with 3 regions each.While in 2D the maximal number of regions is 3 × 3 = 9, this cannot be reached in 1D, for which the maximum is 5.

Figure 6 :
Figure 6: Partition and complexity of some CPWL components.

Figure 8 :
Figure 8: Example of a 1D CPWL path γ : R → R 2 .The value of the characteristic function φ γf along γ is given as a 4D vector and allow one to identify the 3 knots along γ.

Proposition 5 (
Knot density -ReLU).Let (w, b) ∈ R d × R be independent random variables with bounded probability density functions ρ b for b and ρ w for the components of w, which are i.i.d.Then, the expected knot density of the ReLU CPWL component x → ReLU(w T x + b) along any 1D CPWL path γ is bounded as 1≤ℓ1<•••<ℓp≤N p i=1 (n ℓi − 1) such subsets of A. The intersection of the elements of B has dimension (d − p) (recall that the hyperplanes of B are in general position).All in all, we have that#R(A) = (−1) d B⊂A : ∩ H∈B H̸ =∅ (−1) #B (−1) dim(∩ H∈B H) = 1 + (−1) d d k=1 1≤ℓ1<•••<ℓ k ≤N (−1) k (−1) d−k k i=1 (n ℓi − 1) = 1 + d k=1 1≤ℓ1<•••<ℓ k ≤N k i=1 (n ℓi − 1), (49)which is the upper bound given in the theorem.When N ≤ d, we readily check that the bound is giving n 1 • • • n N .To prove the second additional bound for N > d given in the theorem, we invoke the binomial theorem and remark sup t∈R ρ b (t)E[|w T u|] ≤ sup t∈R ρ b (t) E[|w T u| 2 ] = sup t∈R ρ b (t) u T E[ww T ]u ≤ sup t∈R ρ b (t) E[w 2 ].In the last step, the assumption that the random variables w k are i.i.d. has allowed us to infer that E[ww T ] = E[w 2 ]I, where I ∈ R d×d is the identity matrix.If b is normally distributed with standard deviation σ b , then sup t∈R ρ b (t) = (σ b √ 2π) −1 .