VC dimensions of group convolutional neural networks

We study the generalization capacity of group convolutional neural networks. We identify precise estimates for the VC dimensions of simple sets of group convolutional neural networks. In particular, we find that for infinite groups and appropriately chosen convolutional kernels, already two-parameter families of convolutional neural networks have an infinite VC dimension, despite being invariant to the action of an infinite group.


Introduction
Due to impressive results in image recognition, convolutional neural networks (CNNs) have become one of the most widely-used neural network architectures [12,13].It is believed that one of the main reasons for the efficiency of CNNs is their ability to convert translation symmetry of the data into a built-in translationequivariance property of the neural network without exhausting the data to learn the equivariance [4,15].Based on this intuition, other data symmetries have recently been incorporated into neural network architectures.Group convolutional neural networks (G-CNNs) are a natural generalization of CNNs that can be equivariant with respect to rotation [5,24,23,9], scale [21,20,1], and other symmetries defined by matrix groups [7].Moreover, every neural network that is equivariant to the action of a group on its input is a G-CNN, where the convolutions are with respect to the group, [11] (see Theorem 2.10 below).
Although one of the main reasons for constructing equivariant neural networks is their ability to generalize better than neural networks without built-in symmetries, the theoretical understanding of this phenomenon still needs to be better developed.Here, we say that a neural network generalizes if it achieves comparable performance on unseen data compared to the performance on the training data.
The most studied direction in the analysis of equivariant and invariant models is the analysis of sample complexity.Several results show that the sample complexity of learning problems is improved if the objective function and the learned algorithm maintain symmetry.For example, sample complexity is improved by a factor equal to the group size when using an invariant kernel over the group, compared to the corresponding non-invariant kernel [3].In [18], it was observed that a larger volume of a group leads to a smaller generalization error.In addition, [6] analyzes sample complexity by calculating a covering number of the input set and concluding that the sample complexity is much smaller for a neural network with invariances because, in general, the covering number of the orbit space representatives is much smaller than the covering number of the original input set.
Overall, the previous work suggests that assuming more invariances of a model implies better generalization behavior.This work can be considered a counter-point to that intuition in the context of G-CNNs.
Our main contributions are the lower and upper bounds of the VC dimension for a simple two-layer neural network consisting of one convolutional layer and one pooling layer.
If one fixes the convolutional kernel, then the set of associated G-CNNs has only two-free parameters which stem from the involved bias parameters.Nonetheless, we found that in the case of an infinite group, for each n ∈ N, there is a fixed convolution kernel such that the VC dimension of the proposed elementary neural network is at least n.More precisely, when the symmetry group has size n ∈ N, then there is a convolutional kernel such that the VC dimension of the proposed neural network is at least log 2 (n) − 2 log 2 (log 2 (n)) − 4. The details are given in Theorems 3.5 and Theorem 3.6 as well as in Corollary 3.7.The VC dimension of the neural networks with the identified kernel almost matches an associated upper bound, which is log 2 (n) + 9 log 2 (log 2 (n)) (see Corollary 3.4).
Our result allows a significant, potentially counter-intuitive conclusion.While a larger group implies more invariances of the associated group-invariant classifiers, it nonetheless yields a higher VC dimension for an appropriately chosen kernel.
This paper is organized as follows: In Section 2, we introduce the model for group convolutional neural networks that will be analyzed in the rest of the work.Thereafter, in Section 3, we present our main results.First, Theorem 3.1 yields an upper bound for the VC dimension of a G-CNN with a finite group.Second, in Theorem 3.5, we demonstrate an associated lower bound for a carefully chosen kernel K.

Group convolutional neural networks 2.1 Preliminaries
This paper studies the relationship between the symmetry group and the generalization capabilities of associated group convolutional neural networks.First, we formally introduce all necessary concepts.We start with the definition of the type of group used in this work.This group later encodes the symmetries of the associated group convolutional neural networks.To stress this point, we will henceforth often call it symmetry group.Definition 2.1 (Symmetry group).A group (G, •) is a set with an operation that satisfies the following three properties.First associativity holds: Second, there exists an identity element e ∈ G such that g • e = e • g = g, for all g ∈ G.
Third, for all g ∈ G there exists a unique inverse g −1 ∈ G such that Finally, a group is a topological group if it is equipped with a topology such that multiplication with an element and inversion of elements are continuous operations.We will assume all groups in the sequel to be topological groups.Moreover, we assume the topology to be Hausdorff and first-countable.
In the sequel, groups act on certain sets, and we expect the group convolutional neural networks to interact appropriately with the corresponding action.To formalize this, we proceed by defining an action.Definition 2.2 (Action of a symmetry group).Let G be a symmetry group.An action of G on a set X is a map a : G × X → X such that for all g 1 , g 2 ∈ G and for all x ∈ X a(g 1 • g 2 , x) = a(g 1 , a(g 2 , x)).
We are interested in neural networks that are invariant to specific group actions.This means that the output of a neural network does not change if a group action is applied to the input.The standard architecture of an invariant neural network consists of two parts: First, a so-called equivariant neural network, and second, a pooling operation.To clarify this concept, we first recall equivariant maps.Definition 2.3 (Equivariant map).Let X , Y be sets, let G be a group, and a X , a Y be group actions on X and Y, respectively.A map f : Definition 2.3 requires that the group transformations commute with the application of f .In other words, transforming the input before the application of f is equivalent to transforming the output after the application of f .
Equivariant neural networks define parametric families of equivariant maps by composing layers that are individually equivariant with respect to the same group.Definition 2.4 (Equivariant neural network [14]).Let n ∈ N, and X 1 , . . ., X n be sets.Let G be a group, and let a j : G × X j → X j for j = 1, . . ., n be actions of G.An equivariant neural network with respect to the actions (a j ) n j=1 is a function f : X 1 → X n which can be described as the composition of linear maps f i : X i → X i+1 that are equivariant with respect to the actions a i , a i+1 , and coordinate-wise nonlinearity In this definition, for i ∈ [n], the function f i is called the i-th layer of the neural network f .
For comparison, we also recall the classical notion of feed-forward neural networks.Definition 2.5 (Feed-forward neural network).Let n ∈ N and X 1 , . . ., X n be sets.A feed-forward neural network is a function f : X 1 → X n which can be described as the composition of affine maps f i : X i → X i+1 and coordinate-wise nonlinearity ρ : R → R: In this definition, for i ∈ [n], the function f i is called the i-th layer of the neural network f .Equivalently to Definition 2.4, we can define an equivariant neural network as a feed-forward neural network in which each layer is an equivariant function.Moreover, it is convenient to describe equivariant neural networks through generalized convolutions.To this end, we introduce the Haar measure on a compact group.
Theorem 2.6 ([17, Theorem 8.1.2]).Let G be a compact group.Then, there is a left-invariant finite Borel measure, i.e., a finite Borel measure µ, such that µ(A) = µ(g • A) for any measurable set A ⊂ G.This measure is called a Haar measure.The Haar measure is unique up to scaling.Remark 2.7.Every finite topological group is compact.Hence, there exists a Haar measure on every finite group.A Haar measure on the finite group is its counting measure.Since scaling does not affect the results in the rest of the manuscript, we will always choose the counting measure as the Haar measure on a finite group.
On compact groups, we can define the generalized convolution.We will typically use the Haar measure as a measure in the convolution.However, in a couple of special cases later, we require a bit more generality, which is why we state the definition for general finite measures.Definition 2.8 (Generalized convolution).Let G be a compact group, let µ be a finite measure on G, and let f 1 , f 2 : G → R be functions.
Then, the convolution of f 1 with f 2 is defined by Remark 2.9.For the generalized convolution to be sensible, we need for every g ∈ G, the integral G f 1 (g • h −1 ) • f 2 (h)dµ(h) to be well defined.This is guaranteed, for example, if f 1 , f 2 are both µ-measurable and bounded, which we will assume in the sequel.
Theorem 2.10 ( [11]).A feed-forward neural network f is equivariant to the action of a compact group G on its inputs if and only if each layer of f implements a generalized form of convolution with respect to G.
Thanks to Theorem 2.10, each equivariant neural network is based on repeated generalized convolutions and hence can also be called Group Convolutional Neural Network (G-CNN).The standard way of designing an invariant classifier or regressor corresponding to an action a : G × X → X using neural networks is to compose an equivariant neural network with a global pooling operation [11,10,2].Concretely, if f : X → R is an equivariant neural network, then we define f : X → R as To illustrate the concepts described above we give an example.
Example 2.11.Let X := Z 2 , which is a homogeneous space under the standard action of the group of integer translations: An image, such as an image of a handwritten digit, can be thought of as a function on X that, given coordinates, returns the pixel value at the corresponding location: I : X → R. The standard action of the group of integer translations can be extended to the action on images by The generalized convolution, in this case, is a standard convolution layer, and the classifier, which is invariant to the action of T , can be designed by first having n convolution layers and then composing it with an averaging pooling operator.
Similar examples can be constructed with respect to group actions of compact groups incorporating rotations, shearings, or general affine transformations.

Introduction of the G-CNN model
In this paper, we consider the situation already encountered in Example 2.11, where the input space for a neural network is a subset of functions on a homogeneous space X .As mentioned in [11] in this case, after fixing an origin y ∈ X , we can map a function f : X → R to a function f : G → R, by f (g) := f (a(g, y)) for g ∈ G. ( Now f can be convolved with a kernel K : G → R if both f and K are measurable and bounded.Clearly, f is bounded if f was.Hence, for convenience, we define B(X , µ) := {f : X → R : f bounded and f is µ − measurable}, Now the generalized convolution between a function f ∈ B(X , µ) and a kernel K : G → R can be defined as In this work, we study the generalization capabilities of invariant classifiers with architecture described in the previous subsection.These classifiers are a composition of a G-CNN and a pooling operation, followed by the application of a sign function before the output.Typically, the hypothesis class of this type of classifier consists of a neural network with several layers, where each layer is a generalized convolution with a kernel taken from a linear space with a predetermined basis.The learning procedure selects suitable kernels by finding their coordinates in a given basis, i.e., the coordinates are the learnable parameters of the training algorithm.In our analysis, we consider neural networks that have the fewest training parameters, i.e., neural networks with only one convolutional layer and with a fixed kernel.We state the corresponding definition below.
Definition 2.12 (G-CNNs with fixed kernel).Let G be a compact group acting on X .Let K : G → R be a bounded kernel.Let µ be the Haar measure on G. Let c 1 , c 2 ∈ R and set ( Here sign = 21 (0,∞) − 1.We denote the set of G-CNNs with kernel K by H(K) We note that the set H(K) has only two scalar parameters since the kernel K is fixed for the whole class.Even though this neural network contains only two scalar parameters and, in addition, is constrained to be invariant, we will demonstrate in the next section that when the group G contains an infinite number of elements, there exists, for each n ∈ N, a convolution kernel K such that the VC dimension of H(K) is at least n.By the fundamental theorem of learning ([16,Theorem 3.20] or [19,Theorem 6.7]), this implies that successful learning of these neural networks from finitely many samples is impossible in general.On the other hand, we will see that the VC dimension can be upper-bounded for finite groups.

VC dimension of a G-CNN associated with a finite group
To study the generalization capacity of sets of G-CNNs of the form H(K), we compute the so-called VC dimension of these sets, which we denote by VCdim(H(K)).We refer to [19,Definition 6.5] for a formal definition of the VC dimension as well as for the concept of shattering.We recall that, by the fundamental theorem of learning [19,Theorem 6.7], a finite VC dimension facilitates learning, whereas an infinite VC dimension prohibits it.

Upper bound on VC dimension
For finite groups, we have the following upper bound on the VC dimension of H(K).Theorem 3.1 (Upper bound on VC dimension).Let G be a group acting on X , let |G| = n, and let µ be the Haar measure on G.
Proof.For f ∈ B(X , µ), we consider the function F K,f defined by In particular, |F K,f (R)| ≤ n + 1, where we denote by |A| the cardinality of a finite set A.
Assume that, for m ∈ N, VCdim(H(K)) ≥ m, then there are m functions f 1 , . . ., f m which are shattered by H(K).We observe with the help of ( 6) that F K,f k has at most n break points-the points where the functions are not affine-for each f k , k ∈ [m].This yields that the f k collectively have not more than mn break points.Hence, F K,f1 , . . ., F K,fm have at most mn + 1 constant pieces.Consequently, the constant regions of F K,f1 , . . ., F K,fm divide the real line into no more than mn + 1 segments, i.e., there exist we define maps Note that γ i,k is an affine function as F K,f k is constant on Λ i .We proceed by estimating the number of possible classifications of correspond to the vectors which are produced by the map: Then, the number of elements of Proof.Note that, per Definition 2.12 and ( 7) To improve readability in the remainder of the proof, we omit the explicit reference to the kernel K and write H c1,c2 instead of H c1,c2 (K).
Let l ∈ N, and (p j ) l j=1 ⊂ Λ i be the set of intersection points of (γ i,k ) m k=1 .More precisely, We assume that We also define We denote by P(A) the power set of a set A. Let for k ∈ [m], γ i,k : R → R be an affine linear function that coincides with γ i,k on Λ i .We define u : We also set p 0 as the smallest intersection point of ( γ i,k ) m k=1 which is smaller than p 1 or −∞ if such a point does not exist.Similarly, p l+1 is the largest intersection point of ( γ i,k ) m k=1 which is larger than p l or ∞ if such a point does not exist.
Note that u(k, c) is constant on (p j , p j+1 ) for all j = 0, . . ., l.Hence, for j = 1, . . ., l and an arbitrary c * ∈ (p j , p j+1 ) ∩ Λ i it holds that where 1 u(k,c * ) is a vector which equals 1 in coordinate q if q ∈ u(k, c * ) and 0 else.The last inclusion holds because only the order of the (γ i,k (c 1 )) m k=1 influences the classifications that can be produced with all c 2 and that order is, as explained before, the same for all c 1 ∈ (p j , p j+1 ).
We set for c * ∈ (p 0 , p 1 ) Note that, by construction for j = 1, . . ., l Due to ( 9) and ( 10), we have that Moreover, we set Then, it is clear that u(k, c) is constant on (p j−1 , p j+1 ).Therefore, J j \J j−1 contains no more than l j vectors, where l j is the number of γ ′ r s that intersect another γ r ′ with r = r ′ in p j .We conclude that Where the last inequality follows since the maximum number of intersections between m affine lines is equal to m • (m − 1)/2 and by using the trivial estimate of m for J 1 .
We can state two immediate consequences of Theorem 3.
Proof.Thanks to Corollary 3.3, we have that, for VCdim(

Lower bound on VC dimension
In this subsection, we provide the complement to Theorem 3.1 in the form of a lower bound on the VC dimension of H(K) for an appropriately chosen kernel K.The result requires the underlying group G to act on X with an action a that has a trivial kernel, i.e., for an origin y ∈ X it holds that a(g, y) = a(g ′ , y) for g, g ′ ∈ G only if g = g ′ .We state two lower bounds in Theorems 3.5 and 3.6.The first result uses a specific assumption on the group, which allows a larger lower bound on the VC dimension of H(K) in terms of the group size.Concretely, we assume the group to contain an element g = e of order two, i.e. g • g = e, where e is the identity element of the group.E.g., for a rotation group, the element that corresponds to a rotation by π is an element of order two.For finite groups, we show in Lemma 3.10 that a necessary and sufficient condition for H(K) to shatter F is that the associated set of orders contains a so-called complete set of orders, (Definition 3.9).We demonstrate in Lemma 3.14 that a complete set of orders of size m ⌊m/2⌋ exist, let us call it O 1 .In Lemma 3.19, we provide, for every set O of r ∈ N injective maps from [m] to [m], a set of m functions, and a kernel K such that the associated set of orders of H(K) contains O.This requires |G| ≥ 2rm.Applying Lemma 3.19 with r = m ⌊m/2⌋ to O 1 yields that H(K) contains a complete set of orders and finishes the proof if the group is finite.
For infinite groups, the result is shown in Theorem 3.22.
In the general case, when the assumption that the group contains an element of order two is dropped, we have the following result.Proof.The proof of the theorem repeats the proof of Theorem 3.5, but in the case that the group is finite, we use the Lemma 3.20 instead of Lemma 3.19 to prove that H(K) contains a complete set of orders.The case of infinite groups is again addressed by Theorem 3.22.
Next, we state some immediate consequences of Theorem 3.5, which yield a more explicit lower bound on the VC dimension of H(K).
Corollary 3.7.Let m ∈ N, let G be a group acting on X via an action with trivial kernel, let |G| = n > 1, and let µ be the Haar measure on G.
Then, there is a bounded kernel Proof.We start with the second part of the assertion, i.e., the case where G has an element of order two.Note that, we can assume n ≥ 2 8 since otherwise where the result is trivial.Wallis' formula [22] yields that for all m ∈ N Hence, if then, by Theorem 3.5, there is a bounded kernel K : G → R such that VCdim(H(K)) ≥ 2m.Taking logarithms, we conclude that the second inequality of ( 13) is equivalent to We pick m ∈ N, such that 2m which implies ( 14) and hence, by (13) we conclude that The general case follows by a similar argument.We note that we can assume n ∈ N to be such that since the result is trivial otherwise.Then, instead of (13) we use the estimate which yields with Theorem 3.6 the existence of a kernel K : , which is possible because of (15), we conclude with the same computation as above, that In the following subsections, we provide the auxiliary results used to prove Theorems 3.5 and 3.6.

Complete sets of orders
To continue further, we need to introduce some additional notation.We will introduce auxiliary variables ν K,k , which are very closely related to the γ i,k of (7) but defined on the whole domain R and hence not globally affine.
Definition 3.8.Let G be a compact group acting on X , let µ be a finite measure on G, let K : G → R be a bounded kernel and let {f 1 , . . ., f m } ⊂ B(X , µ).We define For c ∈ R, we denote by σ(K, c) the order of (ν We denote by O(K) := {σ(K, c), c ∈ R} the set of orders of (ν K,k (c)) m k=1 obtained by varying c.Requiring that H(K) shatters {f 1 , . . ., f m } imposes restrictions on the set of orders O(K).Vice versa, there is a set O ⊂ P([m]) such that if O ⊂ O(K), then H(K) shatters {f 1 , . . ., f m }.Specifically, we call such sets of orders O complete sets of orders.
Next, we relate the shattering properties of G-CNNs for a given set of functions with a certain kernel K to the property that O(K) contains a complete set of orders.Lemma 3.10.Let G be a group acting on X , let µ be the Haar measure on G, let K : G → R be a bounded kernel and let {f 1 , . . ., f m } ⊂ B(X , µ).
Then, H(K) contains all functions from {f 1 , . . ., f m } to {−1, 1} if and only if O(K) contains a complete set of orders.
Proof.First, we prove that if H(K) shatters F := {f 1 , . . ., f m }, then O(K) contains a complete set of orders.Indeed, if H(K) shatters F , then for every A ⊂ [m] there is a classifier cl ∈ H(K), cl : Let us now prove that if O(K) contains a complete set of orders, then H(K) shatters F .Let cl : F → {−1, 1} be arbitrary and choose A = cl −1 ({−1}) to be the set of elements where the classifier has value −1.
By assumption, we can find σ ∈ O(K), such that σ(i) < σ(j) for all i ∈ A, j ∈ A c .As a consequence, there exists c * such that ν K,i (c * ) < ν K,fj (c * ) for all i ∈ A and j ∈ A c .Therefore, we have that Setting c 2 = −(ν 1 + ν 2 )/2, we get that Based on the connection between H(K) being shattering to properties of sets of orders established in Lemma 3.10, we can now state a lower bound on the size of O(K).Proof.By Lemma 3.10 if H(K) shatters F := {f 1 , . . ., f m }, then for every set A ⊂ [m] containing ⌊m/2⌋ elements there is an order σ A ∈ O(K) such that σ A (i) < σ A (j) for all i ∈ A, j ∈ A c .Moreover, it is easy to see that for two different sets A and B containing ⌊m/2⌋ elements necessarily σ A = σ B .Consequently, the number of permutations in O(K) is not smaller than the number of different sets containing ⌊m/2⌋ elements, which is equal to m ⌊m/2⌋ .The rest of this section shows how to construct a complete set of orders O, containing no more than the necessary number of m ⌊m/2⌋ elements established in Lemma 3.11.Lemma 3.12.Let S(q, l) := {A ⊂ [l] : |A| = q} for q, l ∈ N.Then, there is a bijective map F q,2q−1 : S(q, 2q − 1) → S(q − 1, 2q − 1) such that F q,2q−1 (A) ⊂ A for every set A ∈ S(q, 2q − 1).
Proof.For q = 2, we define Clearly, F 2,3 satisfies the requirements of the lemma.
Lemma 3.13.Let m ∈ N.Then, for all q ∈ [m] there exists a map F q,m : S(q, m) → S(q − 1, m) such that for every A ∈ S(q, m), it holds that F q,m (A) ⊂ A. Moreover, if m q ≥ m q−1 , then F q,m is surjective.If m q ≤ m q−1 , then F q,m is injective.Proof.We prove this statement by induction over m ∈ N. It is easy to see that the statement holds for all q ≤ m for m = 3.Let us assume that the statement holds for all m < r; we will prove the statement for m = r.We denote S q,m := {A ∈ S(q, m) : m ∈ A}, and S c q,m := {A ∈ S(q, m) : m / ∈ A}.
Let P([m]) be the power set of {1, 2, . . ., m}, we define three maps: We prove the statement of Lemma 3.13 by considering three cases.
Case 1: m ≥ 2q.In this case, it holds that Consequently, by the induction hypothesis, there exists F q,m−1 : S(q, m − 1) → S(q − 1, m − 1), such that for all A ⊂ S(q, m − 1) it holds that F q,m−1 (A) ⊂ A, and F q,m−1 is surjective.Also, by the induction hypothesis, there exists F q−1,m−1 : S(q − 1, m − 1) → S(q − 2, m − 1), such that for all A in S(q − 1, m − 1) it holds that F q−1,m−1 (A) ⊂ A, and F q−1,m−1 is surjective.We define F q,m (A) using F q−1,m−1 and F q,m−1 by It is clear that F q,m satisfies all the requirements of the theorem by construction.
Also, by the induction hypothesis, there exists F q−1,m−1 : S(q − 1, m − 1) → S(q − 2, m − 1), such that for all A ⊂ S(q − 1, m − 1) it holds that F q−1,m−1 (A) ⊂ A, and F q−1,m−1 is injective.We define F q,m (A) using F q−1,m−1 and F q,m−1 by It holds that F q,m satisfies all the requirements of the theorem by construction.
In the following lemma, we now establish the existence of relatively small complete sets of orders.The proof of this lemma is based on Lemma 3.13.More specifically, the maps (F q,m ) m q=1 are used to define an order on P([m]).Proof.Let A ∈ S(q, m).We define F 1,m := F q,m and for k ∈ {2, . . ., q}, We define the order σA on A by σA : where F 0,m (A) := A. We have that σA is well defined since, for k A) for all k ′′ < k.Hence, there exists exactly one k ∈ We note that by construction: σA (a) > σA (b), for every a ∈ F k,m (A), and every b We say that an order σ separates a set In particular, (21) yields that σA separates where q = ⌊m/2⌋.We start by defining for Next, we assume that, in the i-th step, for each subset A ∈ S(m − i + 1, m), there is exactly one order σ ∈ O m−i+1 that separates A from A c := [m] \ A. We proceed with the i + 1-st step.We note, that as i ≤ ⌊m/2⌋ it holds that |S(m − i + 1, m)| < |S(m − i, m)|.We define O m−i as the union of O m−i+1 with the following orders: for each A ∈ S(m − i, m), such that there is no order in O m−i+1 separating A from A c , we construct the following order σ A (a) = σA (a), for every a ∈ A, and σ(a We make the following two observations about O m−i : Next, we will show that for Ā ∈ S(m−i, m), the separating order is unique.Assume that Ā ∈ S(m−i, m) is separated from Āc by σ and σ ′ .By the previous observation, we have that σ = σ U and σ Assume towards a contradiction that σ Q = σ U .Then, we conclude that σ V separates for . By induction assumption, we have that However, by (20) it holds that Since F m−i+1,m is injective, we arrive at a contradiction to (24).This yields that σ Q = σ U , and hence we observe that for each A ∈ S(m − i, m), there is exactly one order from O m−i , that separates it.Hence, we conclude that Now, we will prove that every set A ∈ P([m]) is separated from A c by some order from O m−q .We prove this statement by contradiction.
Let A ∈ S(k, m) with the biggest k such that A is not separated from A c by an order from O m−q .By the previous part of the proof, we can assume that k ≥ q.
In this case, as |S(k + 1, m)| > |S(k, m)| it holds that F k+1,m is surjective by Lemma 3.13.Therefore, there exists B ∈ S(k + 1, m), such that F k+1,m (B) = A.Moreover, as k + 1 > k it holds that B is separated from B c by σ ∈ O m−q .By construction, this implies that A is separated from A c by σ, which produces a contradiction.
We conclude that O is a complete set of orders, containing m ⌊m/2⌋ elements.

Construction of an expressive kernel
Let G be a group with |G| ≥ n.In this subsection, we construct a kernel K : G → R such that VCdim(H(K)) is close to the upper bound of Theorem 3.1.We construct for every m ∈ N with 2m • m ⌊m/2⌋ ≤ n a bounded kernel K, such that VCdim(H(K)) ≥ m.We also identify the functions f 1 , . . ., f m that are shattered by H(K).
Recall that an action has trivial kernel if for an origin y ∈ X as in ( 1) the action a on G satisfies that a(g, y) = a(g ′ , y) only if g = g ′ .Therefore, for a function u : G → R, we can define It is not hard to see that, in this case, f = u.Hence, if we find m ∈ N real-valued functions u 1 , . . ., u m on G which are shattered by then there also exist real-valued functions f 1 , . . ., f m on X that are shattered by H(K).Specifically, for g ∈ G with g = e, we will choose the functions u 1 , . . ., u m from the two-dimensional space of functions U generated by the functions 1 e , 1 g .In this case, every function u ∈ U can be written as a linear combination of 1 e , 1 g : To simplify the notation, we define for u as in (25), In the following lemma, we construct for A > 0 functions u 0 , . . ., u 2m ∈ U and associated vectors This choice of u 0 , . . ., u 2m and k i,A is essential for the construction of a kernel that scatters a subset of u 0 , . . ., u 2m .Lemma 3.15.Let m ∈ N, let G be a group acting on X , and let g ∈ G be such that g = e.Let C > B > 0, p ∈ N, and let for i ∈ [p] Then, for all i = 2, 4, . . ., 2p and for all A such that C > A > B, there exist k i,A ∈ R 2 such that Proof.Note that the definition of ǫ i implies that Let A ∈ (B, C) and m ∈ [p].We now show how to find k m,A .Note first that for all α, β ∈ R Since αu 0 + βu 1 = 0 holds for no tuple (α, β) = (0, 0), we conclude that αu 2m−2 + βu 2m−1 = 0 for all α, β = (0, 0).As a consequence, we have that α(a 2m−2,1 , a 2m−2,2 ) = β(a 2m−1,1 , a 2m−1,2 ) for all (α, β) = (0, 0) and we conclude that the matrix has full rank.Note that by construction, for k ∈ R 2 it holds that As a consequence, we have that there Due to the first inequality in (26), we conclude because of A < C that We note that, by (28) with α = ǫ m−1 and β = −1, Therefore, Since 1 e and 1 g are linearly independent, (31) implies that Moreover, we have from (26) that ǫ m−1 > 2C/B + 1/2 > 2C/B + 1/ǫ m−1 and hence it immediately follows that Plugging ( 29), (30), and (33) into (32) yields Also, by a similar argument as above The same argument can be made to bound Let us collect what we have proved so far: 2. For i = 2m − 2 or i = 2m − 1, it follows directly from (29) and (30), that u i (k) < B.

For
Hence, the proof is complete if we show that u i (k) < 0 for all i > 2m + 1.We have that Moreover, Consequently, Finally, it is not hard to see that, if u 2m+3 (k) < 0 and u 2m+2 (k) < 0, then u i (k) < 0 for all i > 2m+3.
Next, we show that for m ∈ N and a given set of orders O ⊂ {σ : [m] → [m]}, if the group G can be partitioned in a specific way, then there exists a kernel such that for the functions of Lemma 3.15, an associated set of orders defined similarly to Definition 3.8 contains O. Lemma 3.16.Let C > B > 0 and r, m ∈ N. Let O = {o 1 , . . ., o r } ⊂ {σ : [m] → [m]}, let G be a group and g ∈ G such that g = e, and let µ be a finite measure on G which satisfies that µ(g) = µ(e) = 1.In addition, for l ∈ [r], assume that G contains subsets 4. H l = {h l,1 , . . ., h l,m }, where h l,i = g −1 • h l,j for all j, i ∈ [r] such that j = i.
We define u 0 , . . ., u 2m as in Lemma 3. Proof.We fix We will construct the kernel K : G → R sequentially.We start by defining the kernel K 0 : G → R to be 0 on all elements of the group G.
Next, we define the kernel K, by iteratively updating K l to yield K l+1 and ultimately setting K = K r .Specifically, on the first step we obtain K 1 by redefining K 0 on H 1 , such that for each i ∈ [m] where k o1(i),C−(m−i+1)ǫ for i ∈ [m] is defined using Lemma 3.15, such that for all p ∈ [m] and p = i and Next, we obtain K l+1 by redefining the values of the the kernel K l on H l+1 .Specifically, for H l := l ′ ≤l H l , let Lemma 3.15 guarantees for every i ∈ [m] the existence of k o l+1 (i),m l −(m−i+1)(M l +ǫ) , such that u 2p (k o l+1 (i),m l −(m−i+1)(M l +ǫ) ) < B, for all p ∈ [m] and p = i and if the conditions of the lemma are satisfied, i.e., if B < m l − (m − i + 1)(M l + ǫ) < C for all l ≤ |O| and i ∈ [m].If the conditions hold, we define We will prove that lemma 3.15 can be applied in Lemma 3.17 below.
Lemma 3.17.For every l ∈ [m] it holds that Proof.For every h ∈ G and every u := a1 e + b1 g , where a, b ∈ R, it holds that As a consequence of the construction, we have that for each p ≤ l In addition, as K l (h) = 0 for every h ∈ H r \ H l it holds by construction that g −1 • h / ∈ H l and hence that Let us check (39) for l = 1.By the construction of which implies the result for l = 1.Next, we check (39) for l = p + 1 if it holds for all l ≤ p.We first note that Consequently, by the construction of K p+1 , for each i ∈ [m] there is exactly one element h ∈ H r such that u 2i * G K p+1 (h) < m p − ǫ, and u 2i * G K p+1 (h) > B. Now, we are ready to estimate M p+1 : If we plug this equality to the definition of M p+1 , we receive: It remains to prove that m p+1 − m • (M p+1 + ǫ) > B. We have by construction that Invoking the definition of ǫ (34), yields Lemma 3.17 shows that in every step m l − (m − i + 1)(M l + ǫ) lies between B and C. Therefore, by Lemma 3.15, we find k o l+1 (i),m l −(m−i+1)(M l +ǫ) such that (37) holds.
Remark 3.18.We list some properties of the kernel K, and the functions (ν K,i ) i∈[m] defined in Lemma 3.16.
1. Let for B, C, m, r as in Lemma 3.16, ǫ be defined according to (34), and let for l ∈ [r] c l := m l − 0.5ǫ, where m l is defined in (35).Then, by construction of K for all l ∈ This property holds because, in the construction in (37), we chose K such that all u 2j * G K(g) for j ∈ [m], g ∈ H r lie between m l − (M l + ǫ) < m l − ǫ and m l+1 for some l ∈ [r].
Thus the order σ(K, −c l ) is equal to the order σ(K, −c l + 0.5ǫ), which is by construction equal to o l for each l ∈ [r].

It holds that |ν
To summarize, the values of (u 2j * where the last equality follows by (45).Thus, for c < −B, the order σ(K, c) associated to u 2 , u 4 , . . .u 2m as defined in Definition 3.8 is equal to the order σ( K, c).We conclude with Lemma 3.16 that O ⊂ O B ( K) ⊂ O(K).This completes the proof.
Remark 3.21.The kernel K of Lemma 3.20 satisfies the following properties: • Let for B, C, m, r as in Lemma 3.20, ǫ be defined according to (34), and let for i ∈ where m i is defined in (35).
Since the kernel K agrees with the kernel of Lemma 3.16 on all entries where it takes a positive value by (45), the conclusion of Remark 3.18 holds for K as well.However, it now holds for all elements of the group.
Concretely, for all l ∈ [r], j ∈ [m] there does not exist g ∈ G such that c l +0.5ǫ > u 2j * G K(g) > c l −0.5ǫ.
As a result, the order σ(K, −c l ) is equal to the order σ(K, −c l + 0.5ǫ), which is by construction equal to o l for each l ∈ [r].
Moreover, for an infinite group G, we can choose (H l ) r l=1 ⊂ G like in the proof of Lemma 3.20 such that H r = l≤r H l ⊂ G.Moreover, for a measure µ such that µ(h) = 1 for all h ∈ H, and µ(h) = 0 for every h ∈ G \ H where the statement of Lemma 3.20 holds.Indeed, in this case, the proof can be carried out mutatis mutandis.
In the Lemma 3.20, we assumed that the group contains a finite number of elements.The following theorem treats the case of infinite groups.Theorem 3.22.Let m ∈ N, let G be an infinite compact group acting on X , and let µ be the Haar measure on G.Then, for every n ∈ N there is a bounded kernel K : G → R such that VCdim(H(K)) ≥ n.
Proof.We fix m ∈ N, g ∈ G, g = e and u 0 , . . ., u 2m as in Lemma 3.15, and let O := {o 1 , . . ., o r } be a complete set of orders.
Next, we show how to construct a kernel K, such that O ⊂ O(K), where O(K) is the set of all orders associated to u 2 , . . ., u 2m as in Definition 3.8.
As the first step of the proof, we introduce an auxiliary measure µ 1 .This will be the counting measure on a finite subset of G. To clarify which measure is used in the convolutions in the sequel, we will use in this proof the notation f * µ1,G K for the convolution of a function f : G → R with the kernel K : G → R using the measure µ 1 .
As we already mentioned in Remark 3.21, since G is infinite, we can choose finite sets H l ⊂ G, l = 1, . . ., r, define , where h i,p ∈ H i , and choose µ 1 to be the counting measure on H.Then, Lemma 3.20 can be applied with µ 1 to yield the filter K, such that O ⊂ O( K), where all convolutional operations are integration by measure µ 1 .
However, since µ = µ 1 , it is not necessarily the case that H( K) shatters u 2 , . . ., u 2m .To correct this, we choose an open neighborhood of the identity U such that h • U ∩ h ′ • U = ∅ for all h = h ′ , h, h ′ ∈ H.
The existence of U follows from the Hausdorff property of G. Indeed, we can construct disjoint open sets (U h ) h∈H such that h ∈ U h for all h ∈ H.Then, we set Since we only take finitely many intersections in the construction of U , it is clear that U is open.Moreover, we directly see that e ∈ U .Assuming there exists h, h ′ ∈ H such that h•U ∩h ′ •U = ∅ implies by construction that Clearly, (46) can only hold if h = h ′ .This shows that U as desired exists.
We define a kernel K as a modification of the kernel K: To continue further, we introduce additional notation.We define a i and b i as the coefficients a i 1 e +b i 1 g := u 2i for i ∈ [2m].Moreover, we define for h ′ ∈ G K ai,bi (h ′ ) := h∈H (a i K(h) We set νK,i : R → R, νK,i (c) := G ReLU(K a2i,b2i (h) − c)dµ(h).
For (c j ) r j=1 defined as in Remark 3.21, we will show now, that for i ∈ [m] the newly defined νK,i (c j ) are equal to µ(U ) • ν K,i (c j ).Indeed, νK,i (c j ) = G ReLU(K a2i,b2i (h 1 ) − c j )dµ(h 1 ) where ν K,i is as in Definition 3.8.The second equality holds since a 2i K(h) + b 2i K(g −1 • h) is greater than c j only for h ∈ H r = l∈[r] H l ; the fourth equality holds since for a Haar measure µ(U ) = µ(h • U ).
From the proven equality, we can conclude that the order of (ν K,i (c j )) i∈[m] is o j for every j ∈ We finish the proof, by modifying u i , i ∈ [2m] to yield u i , i ∈ [2m], and showing that u 2i , i ∈ [m] are shattered by H(K).
We claim that for every δ > 0 there exists open sets U δ such that for all i ∈ [2m] Before we prove (48), we show how it yields the claim.We fix δ := m/4, and denote the functions µ( U δ ) −1 (a i 1 U −1 δ + b i 1 (g• U δ ) −1 ) * G K as K ai,bi , and We prove that H(K) shatters ( u 2i ) i∈[m] by showing that for each j ∈ [r] the order of ( G ReLU( K a2i,b2i (g)− c j )dµ(g)) i∈ where we used the linearity and monotonicity of the integral and the 1-Lipschitz property of the ReLU in the first inequality.Thus, for each j ∈ [r] the order of ( G ReLU( K a2i,b2i (g) − c j )dµ(g)) i∈[m] is equal to the order of (ν K,i (c j )) i∈[m] , which is equal to o j .
We complete the proof by showing (48).First note, that since G is assumed to be first-countable, there exists a sequence of neighborhoods of e, denoted by (N i ) i∈N , such that and, for all neighborhoods Z of e, there exists k ∈ N such that N k ⊂ Z.As a consequence, we have that Indeed, assuming that there exists g ∈ G, g = e with g ∈ N i for all i ∈ N yields a contradiction by invoking the Hausdorff property of G. Concretely, by the Hausdorff property, we have that there exists U with e ∈ U , g ∈ U .Hence, there exists k such that N k ⊂ U which implies g ∈ N k and produces the contradiction.Since µ is finite, we conclude that µ({e}) = 0.Moreover, since µ is a Borel measure, we have by (49) that µ(N k ) → 0 for k → ∞.
In addition, it follows from the continuity of the multiplication that there exists

Theorem 3 . 5 (
Lower bound on VC dimension).Let m ∈ N, let G be a compact group acting on X via an action with trivial kernel, let G contain an element of order two, let |G| ≥ 2m • m ⌊m/2⌋ , and let µ be the Haar measure on G.Then, there is a bounded kernel K : G → R such that VCdim(H(K)) ≥ m.Proof.The theorem follows from the results in Subsections 3.2.1 and 3.2.2below.Concretely, in Definition 3.8, we introduce the notion of a set of orders associated with a kernel and a set of m ∈ N functions F .Orders are injective maps from [m] to [m].

Theorem 3 . 6 (
Lower bound on VC dimension in general case).Let m ∈ N, let G be a compact group acting on X via an action with trivial kernel, let |G| ≥ 9m • m ⌊m/2⌋ , and let µ be the Haar measure on G.Then, there is a bounded kernel K such that VCdim(H(K)) ≥ m.

Lemma 3 . 11 .
Let G be a group acting on X , let µ be the Haar measure on G, let K : G → R be a bounded kernel and let {f 1 , . . ., f m } ⊂ B(X , µ).If H(K) shatters {f 1 , . . ., f m }, then O(K) contains at least m ⌊m/2⌋ elements, where ⌊m/2⌋ is the largest integer part of m/2.

Lemma 3 . 14 .
Let m ∈ N.There is a complete set of orders O, containing no more than m ⌊m/2⌋ elements.