How Deep Neural Networks Learn Compositional Data: The Random Hierarchy Model

,

The achievements of deep learning algorithms [1] are outstanding. These methods exhibit superhuman performances in areas ranging from image recognition [2] to Goplaying [3], and large language models such as GPT4 [4] can generate unexpectedly sophisticated levels of reasoning. However, despite these accomplishments, we still lack a fundamental understanding of the underlying factors. Indeed, Go configurations, images, and patches of text lie in high-dimensional spaces, which are hard to sample due to the curse of dimensionality [5]: the distance δ between neighboring data points decreases very slowly with their number P , as δ = O(P −1/d ) where d is the space dimension. A generic task such as regression of a continuous function [6] requires a small δ for high performance, implying that P must be exponential in the dimension d. Such a number of data is unrealistically large: for example, the benchmark dataset ImageNet [7], whose effective dimension is estimated to be ≈ 50 [8], consists of only ≈ 10 7 data, significantly smaller than e 50 ≈ 10 20 . This immense dimensionality [14,15], and (iii) their sensitivity toward transformations that do not affect the task (e.g. smooth deformations for image classification [16,17]), all eventually decay with the layer depth. In some cases, the magnitude of this decay correlates with performance [16]. However, these studies do not indicate how much data is required to learn such representations, and thus the task.
Here we study this question for tasks which are hierarchically compositional-arguably a key property for the learnability of real data [18][19][20][21][22][23][24][25]. To provide a concrete example, consider the picture of a dog (see Fig. 1). The image consists of several high-level features like head, body, and limbs, each composed of sub-features like ears, mouth, eyes, and nose for the head. These sub-features can be further thought of as combinations of low-level features such as edges. Recent studies have revealed that: (i) deep networks represent hierarchically compositional tasks more efficiently than shallow networks [21]; (ii) the minimal number of data that contains enough information to reconstruct such tasks is polynomial in the input dimension [24], although extracting this information remains impractical with standard optimization algorithms; (iii) correlations between the input data and the task are critical for learning [19,26] and can be exploited by algorithms based on the iteration of clustering methods [22,27]. While these seminal works offer important insights, they do not directly address practical settings, specifically deep convolutional neural networks (CNNs) trained using gradient descent. Consequently, we currently don't know how the hierarchically compositional structure of the task influences the sample complexity, i.e., the number of data necessary to learn the task.
In this work, we adopt the physicist's approach [28][29][30][31] of introducing a simplified model of data, which we then investigate quantitatively via a combination of theoretical arguments and numerical experiments. The task we consider, introduced in Section 1, is a multinomial classification where the class label is determined by the hierarchical composition of input features into progressively higherlevel features (see Fig. 1). This model belongs to the class of generative models introduced in [22,27], corresponding to the specific choice of random composition rules. More specifically, we consider a classification problem with n c classes, where the class label is expressed as a hierarchy of L randomly-chosen composition rules. In each rule, m distinct tuples of s adjacent low-level features are grouped together and assigned the same high-level feature taken from a finite vocabulary of size v (see Fig. 1). Then, in Section 3, we show empirically that the sample complexity P * of deep CNNs trained with gradient descent scales as n c m L . Furthermore, we find that P * coincides with both a) the number of data that allows for learning a representation that is invariant to exchanging the m semantically equivalent low-level features (subsection 3.1) and b) the size of the training set for which the correlations between low-level features and class label become detectable (Section 4). Via b), P * can be derived under our assumption on the randomness of the composition rules.

The Random Hierarchy Model
In this section, we introduce our model task, which is a multinomial classification problem with n c classes, where the input-output relation is compositional, hierarchical, and local. To build the dataset, we let each class label α = 1, . . . , n c generate the set of input data with label α as follows.
i) Each label generates m distinct representations consisting of s-tuples of high-level features (see Fig. 2 for an example with s = 2 and m = n c = 3). Each of these features belongs to a finite vocabulary of size v (v = 3 in the figure), so that there are v s possible representations and n c m ≤ v s . We call the assignment of m distinct s-tuples to each label a composition rule; 1 ii) Each of the v high-level feature (level-L) generates m distinct representations of s sub-features (level-(L−1)), out of the v s possible ones. Thus, m ≤ v s−1 . After two generations, labels are represented as s 2 -tuples and there are m × m s data per class; iii) The input data are obtained after L generations (level-1 representation) so that each datum x consists of d = s L input features x j . We apply one-hot encoding to the input features: each of the x j 's is a vdimensional sequence with one element set to 1 and the others to 0, the index of the non-zero component representing the encoded feature. The number of data per class is m × m s × · · · × m s L−1 = m hence the total number of data P max reads A generic classification task is thus specified by L composition rules and can be represented as a s-ary tree-an example with s = 2 and L = 3 is shown in Fig. 1(c) as a binary tree. The tree representation highlights that the class label α(x) of a datum x can be written as a hierarchical composition of L local functions of s variables [20,21]. For instance, with s = L = 2 (x = (x 1 , x 2 , x 3 , x 4 )), α(x 1 , . . . , x 4 ) = g 2 (g 1 (x 1 , x 2 ), g 1 (x 3 , x 4 )) , where g 1 and g 2 represent the 2 composition rules.
1 Composition rules are called production rules in formal language theory [32].   (head, paws), that in turn can be represented as sets of lower-level features (eyes, nose, mouth, and ear for the head). Notice that, at each level, there can be multiple combinations of low-level features giving rise to the same high-level feature. (b) A similar hierarchical structure can be found in natural language: a sentence is made of clauses, each having different parts such as subject and predicate, which in turn may consist of several words. (c) An illustration of the artificial data structure we propose. The samples reported here were drawn from an instance of the Random Hierarchy Model for depth L = 3 and tuple length s = 2. Different features are shown in different colors.
In the Random Hierarchy Model (RHM) the L composition rules are chosen uniformly at random over all the possible assignments of m low-level representations to each high-level feature. As sketched in Fig. 2, the random choice induces correlations between low-and high-level features. In simple terms, each of the high-level features-1, 2 or 3 in the figure-is more likely to be represented with a certain low-level feature in a given position-blue on the left for 1, yellow for 2 and green for 3. These correlations are crucial for our predictions and are analyzed in detail in Appendix B.
Let us remark that the L composition rules can be chosen such that the low-level features are homogeneously distributed across high-level features for all positions, as sketched in Fig. 3. We refer to this choice as the Homogeneous Features Model. In this model, none of the low-level features is predictive of the high-level feature. With s = 2 and Boolean features v = m = 2, the Homogeneous Features Model reduces to the problem of learning a parity function [33].
Finally, note that we only consider the case where the parameters s, m and v are constant through the hierarchy levels for clarity of exposition. It is straightforward to extend the model, together with the ensuing conclusions, to the case where all the levels of the hierarchy have different parameters. These features belong to a finite vocabulary (blue, orange and green, with size v = 3). Iterating this mapping L times with the lower-level features as high-level features of the next step yields the full dataset. Notice that some features appear more often in the representation of a certain class than in those of the others, e.g. blue on the left appears twice in class 1, once in class 2 and never in class 3. As a result, low-level features are generally correlated with the label. In contrast with the case illustrated in Fig. 2, this mapping is such that each of the 3 possible low-level features appears exactly once in each of the 2 elements of the representation of each class. In maths, denoting with N i (µ; α) the number of times that the low-level feature µ appears in the i-th position of the representation of class α, one has that N i (µ; α) = 1 for all i = 1, 2, for all µ = green, blue, orange and for all α = 1, 2, 3.

Characteristic Sample Sizes
The main focus of our work is the answer to the following question: Q: How much data is required to learn a typical instance of the Random Hierarchy Model?
In this section, we first discuss two characteristic scales of the number of training data for an RHM with n c classes, vocabulary size v, multiplicity m, depth L, and tuple size s. The first, related to the curse of dimensionality, represents the sample complexity of methods that are not able to learn the hierarchical structure of the data. The second, which comes from information-theoretic considerations, represents the minimal number of data necessary to reconstruct an instance of the RHM. These two sample sizes can be thought of as an upper and lower bound to the sample complexity of deep CNNs, which indeed lies between the two bounds (cf. Section 3).

Curse of Dimensionality (P max )
Let us recall that the curse of dimensionality predicts an exponential growth of the sample complexity with the input dimension d = s L . Fig. 4 shows the test error of a one-hidden-layer fully-connected network trained on instances of the RHM while varying the number of training data P (see methods for details of the training procedure) in the maximal m case, m = v s−1 . The bottom panel demonstrates that the sample complexity is proportional to the total dataset size P max . Since, from Eq. 2, P max grows exponentially with d, we conclude that shallow fullyconnected networks suffer from the curse of dimensionality. By contrast, we will see that using CNNs results in a much gentler growth (i.e. polynomial) of the sample complexity with d. test error,¯ 10 1 10 2 10 3 10 4 10 5 10 6

Information-Theoretic Limit (P min )
An algorithm with full prior knowledge of the generative model can reconstruct an instance of the RHM with a number of points P min P max . For instance, we can consider an extensive search within the hypothesis class of all possible hierarchical models with fixed n c , m, v, and L. Then, if n c = v and m = v s−1 , so that the model generates all possible input data, we can use a classical result of the PAC (Probably Approximately Correct) framework of statistical learning theory [34] to relate P min with the logarithm of the cardinality of the hypothesis class, that is the number of possible instances of the hierarchical model. The number of possible composition rules equals the number of ways of allocating v s−1 of v s possible tuples to each of the v classes/features, i.e., a multinomial coefficient, Since an instance consists of L independently chosen composition rules, we have where the additional multiplicative factor (v!) 1−L takes into account that the input-label mapping is invariant for relabeling of the features of the L − 1 internal representations. Upon taking the logarithm and approximating the factorials for large v via Stirling's formula, Intuitively, the problem boils down to understanding the L composition rules, each needing m × v examples (v s for m = v s−1 ). P min grows only linearly with the depth Lhence logarithmically in d-whereas P max is exponential in d. Having used full knowledge of the generative model, P min can be thought of as a lower bound for the sample complexity of a generic supervised learning algorithm which is agnostic of the data structure.

Sample Complexity of Deep CNNs
In this section, we focus on deep learning methods. In particular, we ask Q: How much data is required to learn a typical instance of the Random Hierarchy Model with a deep CNN?
Thus, after generating an instance of the RHM with fixed parameters n c , s, m, v, and L, we train a deep CNN with L hidden layers, filter size and stride equal to s (see Fig. 5 for an illustration) with stochastic gradient descent (SGD) on P training points selected at random among the RHM data. Further details of the training are in Materials and Methods.
By looking at the test error of trained networks as a function of the training set size (top panels of Fig. 6 and Fig. 7, see also Fig. 15 in Appendix G for a study with varying n c ), we notice the existence of a characteristic value of P where the error decreases dramatically, thus the task is learned. In order to study the behavior of this threshold with the parameters of the RHM, we define the sample complexity as the smallest P such that the test error (P ) is smaller than rand /10, with rand = 1 − n −1 c denoting the average error when choosing the label uniformly at random. The bottom panels of Fig. 6 (for the case n c = m = v) and Fig. 7 : multiple channels + non-linearity input } hidden layers output Figure 5: Neural network architecture that matches the RHM hierarchy. This is a deep CNN with L hidden layers, and stride and filter size equal to the tuple length s. Filters that act on different input patches are the same (weight sharing). The number of input channels equals v and the output is a vector of size n c .
(with m < v, see Appendix G for varying n c ) show that the sample complexity scales as independently of the vocabulary size v. Eq. 7 shows that deep CNNs only require a number of samples that scales as a power of the input dimension d = s L to learn the RHM: the curse of dimensionality is beaten. This evidences the ability of CNNs to harness the hierarchical compositionality inherent to the task. The question then becomes: what mechanisms do these networks employ to achieve this feat?

Emergence of Synonymic Invariance in Deep CNNs
A natural approach to learning the RHM would be to identifying the sets of s-tuples of input features that correspond to the same higher-level feature. Examples include the pairs of low-level features in Fig. 2 and Fig. 3 which belong to the same column. In general, we refer to s-tuples that share the same higher-level representation as synonyms. Identifying synonyms at the first level would allow us to replace each s-dimensional patch of the input with a single symbol, reducing the dimensionality of the problem from s L to s L−1 . Repeating this procedure L times would lead to the class labels and, consequently, to the solution of the task.
In order to test if deep CNNs trained on the RHM resort to a similar solution, we introduce the synonymic sensitivity, which is a measure of the invariance of any given function of the input with respect to the exchange of synonymic s-tuples. We define S k,l as the sensitivity of the k-th layer representation of a trained network with respect to exchanges of synonymous tuples of level-l features. Namely, where: f k is the vector of the activations of the k-th layer in the network; P l is an operator that replaces all the level-l tuples with synonyms selected uniformly at random; · with subscripts x, z denote an average over all the inputs in an instance of the RHM; the subscript P l denotes average over all the exchanges of synonyms. In particular, S k,1 quantifies the invariance of the hidden representations learned by the network at layer k with respect to exchanges of synonymic tuples of input features. Fig. 8 reports S 2,1 as a function of the training set size P for different combinations of the model parameters. We focused on S 2,1 -the sensitivity of the second layer of the deep CNN to permutations at the first level of the hierarchy-since synonymic invariance can generally be achieved at all layers k starting from k = l + 1, and not before 2 Notice that all curves display a sigmoidal shape, signaling the existence of a characteristic sample  size which marks the emergence of synonymic sensitivity in the learned representations. Remarkably, by rescaling the x-axis by the sample complexity of Eq. 7 (bottom panel), curves corresponding to different parameters collapse. We conclude that the generalization ability of a network relies on the synonymic invariance of its hidden representations.
Measures of the synonymic sensitivity S k,1 for different layers k are reported in Fig. 9 (blue lines), showing indeed that small values of S k,1 are achieved for k ≥ 2. Fig. 9 also shows the sensitivities to exchanges of synonyms in the higher-level representations of the RHM: all levels are learned together as P increases, and invariance to level-l exchanges is achieved at layer k = l + 1, as expected. The figure displays the test error too (gray dashed), to further emphasize its correlation with synonymic invariance.
Figure 8: Sensitivity S 2,1 of the second layer of a deep CNN to permutations in the first level of the RHM with L = 2, 3, s = 2, n c = m = v, as a function of the training set size (top) and after rescaling by P * = n c m L (bottom). Sensitivity decreases from 1 to approximately zero, i.e. deep CNNs are able to learn synonymic invariance with enough training points. The collapse after rescaling highlights that this can be done with P * training points.

Correlations Govern Synonymic Invariance
We now provide a theoretical argument for understanding the scaling of P * of Eq. 7 with the parameters of the RHM. First, we compute a third characteristic sample size P c , defined as the size of the training set for which the local correlation between any of the input patches and the label becomes detectable. Remarkably, P c coincides with P * of Eq. 7. Secondly, we demonstrate how a one-hiddenlayer neural network acting on a single patch can use such correlations to build a synonymic invariant representation in a single step of gradient descent, so that P c and P * also correspond to the emergence of an invariant representation.

Identify Synonyms by Counting
The invariance of the RHM labels with respect to exchanges of synonymous input patches can be inferred by counting the occurrences of such patches in all the data belonging to a given class α. Intuitively, tuples of features that appear with identical frequencies are likely synonyms.  Figure 9: Permutation sensitivity S k,l of the layers of a deep CNN trained on the RHM with L = 3, s = 2, n c = m = v = 8, as a function of the training set size P . The permutation of synonyms is performed at different levels, as indicated in colors. The different panels correspond to the sensitivity of different layers' activations, indicated by the gray box. Synonymic invariance is learned at the same time for all layers, and most of the invariance to level l is obtained at layer k = l + 1.
More specifically, let us denote an s-dimensional input patch with x j for j in 1, . . . , s L−1 , a s-tuple of input features with µ = (µ 1 , . . . , µ s ), and the number of data in class α which display µ in the j-th patch with N j (µ; α). Normalizing this number by N j (µ) = α N j (µ; α) yields the conditional probability f j (α|µ) for a datum to belong to class α conditioned on displaying the s-tuple µ in the j-th input patch, If the low-level features are homogeneously spread across classes, as in the Homogeneous Features Model of Fig. 3, then f = n −1 c , independently of and α, µ and j. In contrast, due to the aforementioned correlations, the probabilities of the RHM are all different from n −1 c (see Fig. 2). We refer to this difference as signal. 4 Distinct level-1 tuples µ and ν yield a different f (and thus a different signal) with high probability unless they share the same level-2 representation. Therefore, this signal can be used to identify synonymous level-1 tuples.

Signal vs. Sampling Noise
When measuring the conditional class probabilities with only P training data, the occurrences in the right-hand side signal noise Figure 10: Signal vs. noise illustration. The dashed function represents the distribution of f (α|µ) resulting from the random sampling of the RHM rules. The solid dots illustrate the true frequencies f (α|µ) sampled from this distribution, with different colors corresponding to different groups of synonyms. The typical spacing between the solid dots, given by the width of the distribution, represents the signal. Transparent dots represent the empirical frequencieŝ f j (α|µ), with dots of the same color corresponding to synonymous features. The spread of transparent dots of the same color, which is due to the finiteness of the training set, represents the noise.
of Eq. 24 are replaced with empirical occurrences, which induce a sampling noise on the f 's. For the identification of synonyms to be possible, this noise must be smaller in magnitude than the aforementioned signal-a visual representation of the comparison between signal and noise is depicted in Fig. 10.
The magnitude of the signal can be computed as the ratio between the standard deviation and mean of f j (α|µ) over realizations of the RHM. The full calculation is presented in Appendix B: here we present a simplified argument based on an additional independence assumption. Given a class α, the tuple µ appearing in the j-th input patch is determined by a sequence of L choices-one choice per level of the hierarchy-of one among m possible lower-level representations. These m L possibilities lead to all the mv distinct s-tuples. N j (µ; α) is proportional to how often the tuple µ is chosen-m L /(mv) times on average. Under the assumption of independence of the m L choices, the fluctuations of N j (µ; α) relative to its mean are given by the central limit theorem and read (m L /(mv)) −1/2 in the limit of large m. If n c is sufficiently large, the fluctuations of N j (µ) are negligible in comparison. Therefore, the relative fluctuations of f j are the same as those of N j (µ; α): the size of the signal is (m L /(mv)) −1/2 .
The magnitude of the noise is given by the ratio between the standard deviation and mean, over independent samplings of a training set of fixed size P , of the empirical conditional probabilitiesf j (α|µ). Only P/(n c mv) of the training points will, on average, belong to class α while displaying feature µ in the j-th patch. Therefore, by the convergence of the empirical measure to the true probability, the sampling fluctuations off relative to the mean are of order [P/(n c mv)] −1/2 -see Appendix B for details.
Balancing signal and noise yields the characteristic P c for the emergence of correlations. For large m, n c and P , which coincides with the empirical sample complexity of deep CNNs discussed in Section 3.

Learning Synonymic Invariance by the Gradients
To complete the argument, we consider a simplified onestep gradient descent setting [35,36], where P c marks the number of training examples required to learn a synonymic invariant representation. In this setting (details presented in Appendix C), we train a one-hidden layer fully-connected network on the first s-dimensional patches of the data. This network cannot fit data which have the same features on the first patch while belonging to different classes. Nevertheless, the hidden representation of the network can become invariant to exchanges of synonymous patches.
More specifically, as we show in Appendix C, with identical initialization of the hidden weights and orthogonalized inputs, the update of the hidden representation f h (µ) of the s-tuple of low-level features µ after one step of gradient descent follows where f h (µ) coincides the pre-activation of the h-th neuron and a h = (a h,1 , . . . , a h,nc ) denotes the associated n c dimensional readout weight.N 1 is used to denote the empirical estimate of the occurrences in the first input patch. Hence, by the result of the previous section, the hidden representation becomes insensitive to the exchange of synonymic features for P P c . This prediction is confirmed empirically in Fig. 11, which shows the sensitivity S 1,1 of the hidden representation of shallow fully-connected networks trained in the setting of this section, as a function of the number P of training data for different combinations of the model parameters. The bottom panel, in particular, highlights that the sensitivity is close to 1 for P P c and close to 0 for P P c . In addition, notice that the collapse of the pre-activations of synonymic tuples onto the same, synonymic invariant value, implies that the rank of the hidden weights matrix tends to v-the vocabulary size of higher-level features. This low-rank structure is typical in the weights of deep networks trained on image classification [37][38][39][40].
Using all patches via weight sharing. Notice that using a one-hidden-layer CNN which looks at all patches via Figure 11: Synonymic sensitivity of the hidden representation vs P for a one-hidden-layer fully-connected network trained on the first patch of the inputs of an RHM with s = 2 and m = v, for several values of L, v, and n c ≤ v. The top panel shows the bare curves whereas, in the bottom panel, the x-axis is rescaled by P c = n c m L . The collapse of the rescaled curves highlights that P c coincides with the threshold number of training data for building a synonymic invariant representation.
weight sharing and global average pooling would yield the same result since the average over patches reduces both the signal and the noise by the same factor-see subsection C.1 for details.
Improved Performance via Clustering. Note that our signal-vs-noise argument is based on a single class α, as it considers the scalar quantityf (α|µ). However, an observer seeking to identify synonyms could in principle use the information from all classes, represented by the n c -dimensional vector of empirical frequencies (f (α|µ)) α=1,...,nc . Following this idea, one can devise a layer-wise algorithm where the representations of each layer are first updated with a single step of gradient descent (as in Eq. 82), then clustered into synonymic groups [22,27]. Such an algorithm can solve the RHM with less than n c m L training points-√ n c m L in the maximal dataset case n c = v and m = v s−1 , as we show empirically and justify theoretically in Appendix D. Notably, the dependence on the dimensionality m L is unaffected by the change of algorithm, although the prefactor reveals the advantage of the dedicated clustering algorithm over standard CNNs.

Conclusions
We have introduced a hierarchical model of classification task, where each class is identified by a number of equivalent high-level features (synonyms), themselves consisting of a number of equivalent sub-features according to a hierarchy of random composition rules. First, we established via a combination of extensive esperiments and theoretical arguments that the sample complexity of deep CNNs is a simple function of the number of classes n c , the number of synonymic features m and the depth of the hierarchy L. This result provides a rule of thumb for estimating the order of magnitude of the sample complexity of real datasets. In the case of CIFAR10 [41], for instance, having 10 classes, taking reasonable values for the RHM parameters such as m ∈ [5, 15] and L = 3, yields P * ∈ [10 3 , 3×10 4 ],comparable with the sample complexity of modern architectures (see Fig. 16 in Appendix G).
Secondly, our results indicate a separation between shallow networks, which are cursed by the input dimensionality, and sufficiently deep CNNs, which are not. We thus complement previous analyses based on expressivity [21] or information-theoretical considerations [24] with a generalization result.
Last but not least, we proposed to characterize the quality of internal representations with their sensitivity toward transformations of the data which leave the task invariant. This analysis bypasses the issues of previous characterizations. For example, approaches based on mutual information [12] that is ill-defined when the network representations are deterministic functions of the input [13]. Approaches based on intrinsic dimension [14,15] can display counter-intuitive results, refer to Appendix F for a more in-depth discussion on the intrinsic dimension, and on how this quantity behaves in our setup. Interestingly, our approach supports that performance should strongly correlate with the invariance toward synonyms of the internal representation. This prediction could in principle be tested in natural language processing models, but also for image data sets by performing discrete changes to images that leave the class unchanged.
Looking forward, the Random Hierarchy Model is a rich but minimal model where open questions in the theory of deep learning could be clarified. For instance, a formidable challenge such as the description of the gradient descent dynamics of deep networks, becomes significantly simpler for the RHM, owing to the simple structure of the target representations. Other important questions, including the ability of fully-connected networks to learn local connections [30,42,43], the benefits of residual connections [44] or the advantages of deep learning over kernel methods [25,[45][46][47] can be studied quantitatively within this model, as functions of the multiple parameters that define the hierarchical structure of the task.

Experimental Setup
The experiments are performed using the PyTorch deep learning framework [48]. The code used for the experiments is available online at https://github.com/pcsl-epfl/ hierarchy-learning.
The inputs sampled from the RHM are represented as a one-hot encoding of low-level features. This makes each input of size s L × v. The inputs are whitened so that the average pixel value over channels is equal to zero.

Model Architecture
One-hidden-layer fully-connected networks have input dimension equal to s L ×v, H = 10 4 hidden neurons, and n c outputs. The deep convolutional neural networks (CNNs) have weight sharing, stride equal to filter size equal to s and L hidden layers. In this case, we set the width H to be larger than the number of possible s-tuples that can exist at a given layer, H v s .

Training Procedure
Neural networks are trained using stochastic gradient descent (SGD) on the cross-entropy loss, with a batch size of 128 and a learning rate equal to 0.3. Training is stopped when the training loss decreases below a certain threshold fixed to 10 −3 .

Measurements
The performance of the models is measured as the percentage error on a test set. The test set size is chosen to be min(P max −P, 20 000). Synonymic sensitivity, as defined in Eq. 8, is measured on a test set of size min(P max −P, 1 000). Reported results for a given value of RHM parameters are averaged over 10 jointly different instances of the RHM and network initializations. [30] Alessandro Ingrosso and Sebastian Goldt. Data-driven emergence of convolutional structure in neural networks. Proceedings of the National Academy of Sciences, 119(40), 2022.

References
[31] Yu Feng and Yuhai Tu. The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima. Proceedings of the National Academy of Sciences, 118(9), 2021.

Appendix A Statistics of The Composition Rules
In this section, we consider a single composition rule, that is the assignment of m s-tuples of low-level features to each of the v high-level features. In the RHM these rules are chosen uniformly at random over all the possible rules, thus their statistics are crucial in determining the correlations between the input features and the class label.

A.1 Statistics of a single rule
For each rule, we call N i (µ 1 ; µ 2 ) the number of occurrences of the low-level feature µ 1 in position i of the s-tuples generated by the higher-level feature µ 2 . The probability of N i (µ 1 ; µ 2 ) is that of the number of successes when drawing m (number of s-tuples associated with the high-level feature µ 2 ) times without replacement from a pool of v s (total number of s-tuples with vocabulary size v) objects where only v s−1 satisfy a certain condition (number of s-tuples displaying feature µ 1 in position i): which is a hypergeometric distribution Hg v s ,v s−1 ,m , with mean and variance independently of the position i and the specific low-and high-level features. Notice that, since m ≤ v s−1 with s fixed, large m implies also large v.

A.2 Joint statistics of a single rule
Shared high-level feature. For a fixed high-level feature µ 2 , the joint probability of the occurrences of two different low-level features µ 1 and ν 1 is a multivariate Hypergeometric distribution, giving the following covariance, The covariance can also be obtained via the constraint µ1 N i (µ 1 ; µ 2 ) = m. For any finite sequence of identically distributed random variables X µ with a constraint on the sum In the last line, we used the identically distributed variables hypothesis to replace the sum over µ = ν with the factor (v − 1). Therefore, Shared low-level feature. The joint probability of the occurrences of the same low-level feature µ 1 starting from different high-level features µ 2 = ν 2 can be written as follows, resulting in the following 'inter-feature' covariance, No shared features. Finally, by multiplying both sides of µ1 N (µ 1 ; µ 2 ) = m with N (ν 1 ; ν 2 ) and averaging, we get

B Emergence of input-output correlations (P c )
As discussed in the main text, the Random Hierarchy Model presents a characteristic sample size P c corresponding to the emergence of the input-output correlations. This sample size predicts the sample complexity of deep CNNs, as we also discuss in the main text. In this appendix, we prove that

B.1 Estimating the Signal
The correlations between input features and the class label can be quantified via the conditional probability (over realizations of the RHM) of a data point belonging to class α conditioned on displaying the s-tuple µ in the j-th input patch, f j (α|µ) := Pr {x ∈ α|x j = µ} , where the notation x j = µ means that the elements of the patch x j encode the tuple of features µ. We say that the low-level features are correlated with the output if and define a 'signal' as the difference f j (α|µ) − n −1 c . In the following, we compute the statistics of the signal over realizations of the RHM.
Occurrence of low-level features. Let us begin by defining the joint occurrences of a class label α and a low-level feature µ 1 in a given position of the input. Using the tree representation of the model, we will identify an input position with a set of L indices i = 1, . . . , s, each indicating which branch to follow when descending from the root (class label) to a given leaf (low-level feature). These joint occurrences can be computed by combining the occurrences of the single rules introduced in Appendix A of this Appendix. With L = 2, for instance, where: i2 (µ 2 ; α) 5 counts the occurrences of µ 2 in position i 2 of the level-2 representations of α, i.e. the s-tuples generated from α according to the second-layer composition rule; ii) N (1) i1 (µ 1 ; µ 2 ) counts the occurrences of µ 1 in position i 1 of the level-1 representations of µ 2 , i.e. s-tuples generated by µ 2 according to the composition rule of the first layer; iii) the factor m s−1 counts the descendants of the remaining s − 1 elements of the level-2 representation (m descendants per element); iv) the sum over µ 2 counts all the possible paths of features that lead to µ 1 from α across 2 generations.
The generalization of Eq. 26 is immediate once one takes into account that the multiplicity factor accounting for the descendants of the remaining positions at the -th generation is equal to m s −1 /m (s −1 is the size of the representation at the previous level). Hence, the overall multiplicity factor after L generations is so that the number of occurrences of feature µ 1 in position i 1 . . . i L of the inputs belonging to class α is where we used i 1→L as a shorthand notation for the tuple of indices i 1 , i 2 , . . . , i L . The same construction allows us to compute the number of occurrences of up to s−1 features within the s-dimensional patch of the input corresponding to the path i 2→L . The number of occurrences of a whole s-tuple, instead, follows a slightly different rule, since there is only one level-2 feature µ 2 which generates the whole s-tuple of level-1 features µ 1 = (µ 1,1 , . . . , µ 1,s )-we call this feature g 1 (µ 1 ), with g 1 denoting the first-layer composition rule. As a result, the sum over µ 2 in the right-hand side of Eq. 28 disappears and we are left with Coincidentally, Eq. 29 shows that the joint occurrences of a s-tuple of low-level features µ 1 depend on the level-2 feature corresponding to µ 1 . Hence, N i 2→L (µ 1 ; α) is invariant for the exchange of µ 1 with one of its synonyms, i.e. level-1 tuples ν 1 corresponding to the same level-2 feature.
Class probability conditioned on low-level observations. We can turn these numbers into probabilities by normalizing them appropriately. Upon dividing by the total occurrences of a low-level feature µ 1 independently of the class, for instance, we obtain the conditional probability of the class of a given input, conditioned on the feature in position i 1 . . . i L being µ 1 .
Let us also introduce, for convenience, the numerator and denominator of the right-hand side of Eq. 30.

B.1.1 Statistics of the numerator U
We now determine the first and second moments of the numerator of f (1→L) i 1→L (µ 1 ; α). Let us first recall the definition for clarity, Level 1 L = 1. For L = 1, U is simply the occurrence of a single production rule N i (µ 1 ; α), where the relationship between variance and covariance is due to the constraint on the sum of U (1) over µ 1 , see Eq. 17. Therefore, Level L. In general, Therefore, Concentration for large m. In the large multiplicity limit m 1, the U 's concentrate around their mean value. Due to m ≤ v s−1 , large m implies large v, thus we can proceed by setting m = qv s−1 , with q ∈ (0, 1] 6 and studying the v 1 limit. From Eq. 41, In addition, so that The second of the three terms is always subleading with respect to the first, so we can discard it for now. It remains to compare the first and the third terms. For L = 2, since σ 2 U (1) = σ 2 N , the first term depends on v as v 2(s−1)−1 , whereas the third is proportional to v 3(s−1)−2 . For L ≥ 3 the dominant scaling is that of the third term only: for L = 3 it can be shown by simply plugging the L = 2 result into the recursion, and for larger L it follows from the fact that replacing σ 2 U (L−1) in the first term with the third term of the precious step always yields a subdominant contribution. Therefore, Upon dividing the variance by the squared mean we get whose convergence to 0 guarantees the concentration of the U 's around the average over all instances of the RHM.

B.1.2 Statistics of the denominator D
Here we compute the first and second moments of the denominator of f Level 1 L = 1. For L = 1, D is simply the sum over classes of the occurrences of a single production rule, where, in the last line, we used the identities σ 2 N + (v − 1)c N = 0 from Eq. 16 and c if + (v − 1)c g = 0 from Eq. 22.
Level 2 L = 2. For L = 2, Therefore, Level L. In general, Therefore, Concentration for large m. Since the D's can be expressed as a sum of different U 's, their concentration for m 1 follows directly from that of the U 's.

B.1.3 Estimate of the conditional class probability
We can now turn back to the original problem of estimating Having shown that both numerator and denominator converge to their average for large m, we can expand for small fluctuations around these averages and write Since the conditional frequencies average to n −1 c , the term in brackets averages to zero. We can then estimate the size of the fluctuations of the conditional frequencies (i.e. the 'signal') with the standard deviation of the term in brackets.
It is important to notice that, for each L and position i 1→L , D is the sum over α of U , and the U with different α at fixed low-level feature µ 1 are identically distributed. In general, for a sequence of identically distributed variables (X α ) α=1,...,nc , Hence, In our case where, in the second line, we have used that U (L) = D (L) /n c to convert the difference of second moments into a difference of variances. By Eq. 41 and Eq. 58, having used again that U (L) = D (L) /n c . Iterating, Since One has so that B.2 Introducing sampling noise due to the finite training set In a supervised learning setting where only P of the total data are available, the occurrences N are replaced with their empirical counterpartsN . In particular, the empirical joint occurrenceN (µ; α) 7 coincides with the number of successes when sampling P points without replacement from a population of P max where only N (µ; α) belong to class α and display feature µ in position j. Thus,N (µ; α) obeys a hypergeometric distribution where P plays the role of the number of trials, P max the population size, and the true occurrence N (µ; α) the number of favorable cases. If P is large and P max , N (µ; α) are both larger than P , then where the convergence is meant as a convergence in probability and N (a, b) denotes a Gaussian distribution with mean a and variance b. The statement above holds when the ratio N (µ; α)/P max is away from 0 and 1, which is true with probability 1 for large v due to the concentration of f (α|µ). In complete analogy, the empirical occurrenceN (µ) obeyŝ We obtain the empirical conditional frequency by the ratio of Eq. 72 and Eq. 73. Since N (µ) = P max /v and f (α|µ) = N (µ; α)/N (µ), we havef where ξ P and ζ P are correlated zero-mean and unit-variance Gaussian random variables over independent drawings of the P training points. By expanding the denominator of the right-hand side for large P we get, after some algebra, Recall that, in the limit of large n c and m, f (α|µ) = n −1 c (1 + σ f ξ RHM ) where ξ RHM is a zero-mean and unit-variance Gaussian variable over the realizations of the RHM, while σ f is the 'signal', σ 2 f = v/m L by Eq. 71. As a result, 7 For ease of notation, we drop level and positional indices in this subsection.

B.3 Sample complexity
From Eq. 76 it is clear that for the 'signal'f , the fluctuations due to noise must be smaller than those due to the random choice of the composition rules. Therefore, the crossover takes place when the two nose terms have the same size, occurring at P = P c such that C One-Step Gradient Descent (GD) We will consider a simplified but tractable setting, where we generate an instance of the RHM and then train a one-hidden-layer fully-connected network only on the first s-dimensional patch of the input. Since there are many data having the same first s-dimensional patch but a different label, this network does not have the capacity to fit the data. Nevertheless, in the case where the s-dimensional patches are orthogonalized, neural networks can learn the synonymic invariance of the RHM if trained on at least P c data.
GD on Cross-Entropy Loss. More specifically, let us first sample an instance of the RHM, then P input-label pairs (x k , α k ) with α k := α(x k ) for all k = 1, . . . , P . For any datum x, we denote with µ 1 (x) the s-tuple of features in the first patch and with δ µ the one-hot encoding of the s-tuple µ (with dimension v s ). The fully-connected network acts on the one-hot encoding of the s-tuples with ReLU activations σ(x) = max (0, x), where the inner-layer weights w h 's have the same dimension as δ µ and the top-layer weights a h 's are n c -dimensional. The top-layer weights are initialized as i.i.d. Gaussian with zero mean and unit variance and fixed. The w h 's are trained by Gradient Descent (GD) on the cross-entropy loss, where δ β,α(x) stems from the one-hot encoding of the class label α(x) andÊ denotes expectation over the training set.
For simplicity, we consider the mean-field limit H → ∞, so that F (0) NN = 0 identically, and initialize all the inner-layer weights to 1 (the vector with all elements set to 1) 8 .
Update of the Hidden Representation. In this setting, with enough training points, one step of gradient descent is sufficient to build a representation invariant to the exchange of synonyms. Due to the one-hot encoding, (w h · δ µ ), namely the h-th component of the hidden representation of the s-tuple µ, coincides with the µ-th component of the weight w h . This component, which is set to 1 at initialization, is updated by (minus) the corresponding component of the gradient of the loss in Eq. 79. Recalling that at initialization the predictor is 0 and all the components of the inner-layer weights are 1, we get whereN 1 (µ) is the empirical occurrence of the s-tuple µ in the first patch of the P training points andN 1 (µ; α) is the (empirical) joint occurrence of the s-tuple µ and the class label α. As P increases, the empirical occurrencesN converge to the true occurrences N , which are invariant for the exchange of synonym s-tuples µ. Hence, the hidden representation is also invariant for the exchange of synonym s-tuples in this limit.

C.1 Extension to a one-hidden-layer CNN
The same argument can be carried out by considering a one-hidden-layer CNN with weight sharing and global average pooling: where we added an average over all input patches j. The gradient updates now read, hence, synonymic invariance can now be inferred from the average occurrences over patches. This average results in a reduction of both the signal and noise term by the same factor √ s L−1 . Consequently, analogously to the case without weight sharing, the hidden representation becomes insensitive to the exchange of synonymic features for P P c = n c m L .

D Improved Sample Complexity via Clustering
In Section 4.C of the main text and Appendix 4, we showed that the hidden representation of a one-hidden layer fully-connected network trained on the first patch of the RHM inputs becomes insensitive to exchanges of synonyms at P = P * = P c = n c m L . Here we consider the maximal dataset case n c = v and m = v s−1 , and show that a distance-based clustering method acting on these hidden representations would identify synonyms at P √ n c m L , much smaller than If µ and ν are synonyms, then g(µ) = g(ν) and only the noise term contributes to the right-hand side of Eq. 87. If this noise is sufficiently small, then the distance above can be used to cluster tuples into synonymic groups. By the independence of the noises and the Central Limit Theorem, for n c 1, over independent samplings of the P training points. The g's are also random variables over independent realizations of the RHM with zero mean and variance proportional to the variance of the conditional probabilities f (α|µ) (see Eq. 62 and Eq. 71), To estimate the size of g(µ) − g(ν) 2 we must take into account the correlations (over RHM realizations) between g's with different class label and tuples. However, in the maximal dataset case n c = v and m = v s−1 , both the sum over classes and the sum over tuples of input features of the joint occurrences N (µ; α) are fixed deterministically. The constraints on the sums allow us to control the correlations between occurrences of the same tuple within different classes and of different tuples within the same class, so that the size of the term g(µ) − g(ν) 2 for n c = v 1 can be estimated via the Central Limit Theorem: The mixed term (g(µ) − g(ν)) · (η(µ) − η(ν)) has zero average (both with respect to training set sampling and RHM realizations) and can also be shown to lead to relative fluctuations of order O( √ n c ) in the maximal dataset case.
Tu sum up, we have that, for synonyms, where ξ P is some O(1) noise dependent on the training set sampling. If µ and ν are not synonyms, instead, where ξ RHM is some O(1) noise dependent on the RHM realization. In this setting, the signal is the deterministic part of the difference between representations of non-synonymic tuples. Due to the sum over class labels, the signal is scaled up by a factor n c , whereas the fluctuations (stemming from both sampling and model) are only increased by O √ n c . Therefore, the signal required for clustering emerges from the sampling noise at P = P c / √ n c = √ n c m L , equal to v 1/2+L(s−1) in the maximal dataset case. This prediction is tested for s = 2 in Fig. 12, which shows the error achieved by a layerwise algorithm which alternates single GD steps to clustering of the resulting representations [22,27]. More specifically, the weights of the first hidden layer are updated with a single GD step while keeping all the other weights frozen. The resulting representations are then clustered, so as to identify groups of synonymic level-1 tuples. The centroids of the ensuing clusters, which correspond to level-2 features, are orthogonalized and used as inputs of another one-step GD protocol, which aims at identifying synonymic tuples of level-2 features. The procedure is iterated L times.
10 0 10 1 P/v L+1 Figure 12: Sample complexity for layerwise training, m = n c = v, L = 3, s = 2. Training of a L-layers network is performed layerwise by alternating one-step GD as described in Section 4.C and clustering of the hidden representations.
Clustering of the mv = v 2 representations for the different one-hot-encoded input patches is performed with the k-means algorithms. Clustered representations are then orthogonalized and the result is given to the next one-step GD procedure. Left: Test error vs. number of training points. Different colors correspond to different values of v. Center: collapse of the test error curves when rescaling the x-axis by v L+ 1 /2 . Right: analogous, when rescaling the x-axis by v L+1 . The curves show a better collapse when rescaling by v L+ 1 /2 , suggesting that these layerwise algorithms as an advantage of a factor √ v over end-to-end training with deep CNNs, for which P * = v L+1 .

E Instances of the Homogeneous Feature Model (HFM) in the Random Hierarchy Model (RHM)
Given that the rules in the RHM are chosen uniformly at random, a non-zero probability exists that an HFM, where no input-output correlations exist, is sampled as an instance of the RHM. In these instances, semantic invariance, and good generalization, would be impossible to learn from correlations, as illustrated in the main text. In this appendix we show that such specific instances of the RHM are sampled with vanishing probability, for increasing values of the RHM parameters.
The number of rules for the RHM made by L layers for generic values of m and v is given by The number of HFM rules, defined by having N i (µ; α) = v s−2 independently of µ in each of the single-layer rules, can be computed as follows. Let's look at a given feature of the previous layer, for example the symbol 1. We want to assign to it m = v s−1 s−tuples where the first v s−2 tuples has as first symbol 1, the second v s−2 the symbol 2 and we continue up to the last v s−2 tuples with first symbol v. The numbers {α 1,t i } i=1,...,v s−1 are permutations of the set {1, ..., v} v s−2 for feature 1 and location t. In these rules, there is no symbol that occurs more than the others at a given location. Consequently, the network cannot exploit any correlation between the presence of a symbol at a given location and the label to solve the task. With regard to the other features j of the previous layer, to any of these we will assign the v s−1 tuples (1, α j,1 1 , ..., α j,s−1 1 ), ..., (v, α j,1 v s−1 , , ..., α j,s−1 v s−1 ), with the numbers {α j,t i } being the same of α 1,t i for any t but shifted forward of (j − 1)v positions. For example, the numbers related to the "block" of tuples with first element 1 for feature 1 will be the same related to the "block" of tuples with first element 2 for feature. In formulae: α j,t i = α 1,t i−(j−1)v , with i − (j − 1)v being equivalent to (v s−1 − (v − i) + 1) (periodic boundary conditions). The number of such uncorrelated rules is just the number (v s−1 )! of permutations of the numbers {α 1,t i } i=1,...,v s−1 for a fixed tuple location t, elevated to the number of positions s − 1. Consequently, the fraction of uncorrelated rules for L layers is: We now want to show that f HFM vanishes for large v. We implement the Stirling approximation in (95) getting yielding the following limit behavior for large v: which is vanishing for large v and large L.

E.2 Generic m case
Let's characterize the uncorrelated rules in the case of generic m. For each single-layer rule, we assign m s−tuples to each symbol of the previous layer. Let's consider a given symbol j. To this symbol, we assign m s−tuples of the type (α j,1 1 , ..., α j,s 1 ), (α j,1 2 , ..., α j,s 2 ), ..., (α j,1 m , ..., α j,s m ), with the m numbers (α j,t i ) i at fixed location t being a permutation of a subset of {1, ..., v} v s−2 such that, if we call m q the number of items in the subset equal to q ∈ {1, ..., v}, each m q is either 0 or we have that m q1 = m q2 for q 1 and q 1 such that m q1 = m q2 > 0 and v q=1 m q = m. Moreover, since each symbol q can appear at most v s−2 times, we have that m q ≤ v s−2 . We take the m numbers (α j,1 i ) i=1,...m at the first location t = 1 ordered in increasing order. Note that these tuples can be picked just once across different symbols j, imposing then constraints on the numbers α j,t i for different features j. As in the case of m = v s−1 in Sec. E.1, we want to show that the probability of occurrence f HFM of such uncorrelated rules, given by the number of these rules divided by the number of total rules, is vanishing for large v and/or large L. To count the number of uncorrelated rules, that we call # HFM , we first count the number # j,t of possible series of numbers {α f,t i } i=1,...,m for a fixed feature j and position t > 1, and for a single-layer rule. In other words, we have to count the number of possible subsets made by m elements of {1, ..., v} v s−2 such that each symbol q ∈ {1, ..., v} appears m q times, with the m q satisfying the constraints above. We introduce the quantity v 0 which is the number of symbols q which appears 0 times in a given subset. Once we fix v 0 , from the constraint v q=1 m q = m we get that the features with m q > 0 appearm = m v−v0 if m v−v0 is a positive integer, otherwise, there are no subsets with that v 0 . Consequently, the number # j,t is given by: where (i) v v0 counts the number of choices of the features with 0 appearances and (ii) m! counts the number of permutations of the m numbers {α j,t i } i . Since we are interested in proving that f HFM is vanishing for large v and L, we upper bound it relaxing the constraint that m v−v0 ∈ N >0 and using that v v0 ≤ v v/2 : Considering all the s locations, we get where # j is defined similarly as # f,t . Notice that for the first location t = 1 there is not a factor m! since we are ordering the numbers α j,1 i in ascending order there. If we want to count # HFM , we have to take into account that we are sampling without replacement from the space of s−tuples, and hence two different symbols j cannot share the same s− tuples, hence increasing the number of possible rules. To upper bound # HFM , we relax this constraint, hence sampling the tuples with replacement. Consequently, we have: since the choice of the α j,t i is independent between different features j. Consequently for a L−layer rule: We now assume m ∼ v α , with 0 < α < (s − 1) and for large n we implement the Stirling approximation 9 , getting: where we approximated v/2 with v/2. At the leading order for large v we get, using the fact that (α + 1) < s hence the probability of occurrence of a parity-like rule is vanishing for large n and L. For v = 2 the networks can generalize to a number of training points P which scales with the total size of the dataset P max . Increasing v, performance is very close to chance already at v = 4.

F Intrinsic Dimensionality of Data Representations
In deep learning, the representation of data at each layer of a network can be thought of as lying on a manifold in the layer's activation space. Measures of the intrinsic dimensionality of these manifolds can provide insights into how the networks lower the dimensionality of the problem layer by layer [14,15]. However, such measurements have challenges. One key challenge is that it assumes that real data exist on a smooth manifold, while in practice, the 9 The Stirling approximation for the multinomial n a 1 ...a k for n → ∞ and integers a i such that k i=1 = n is given by n a 1 ...a k ∼ (2πn) (1/2−k/2) k n+k/2 exp{− k 2n k i=1 (a i − n/k) 2 }. In our case n = v s , k = (v + 1) and a i = m for i ∈ 1, ..., v and a k+1 = (v s − vm). dimensionality is estimated based on a discrete set of points. This leads to counter-intuitive results such as an increase in the intrinsic dimensionality with depth, especially near the input. An effect that is impossible for continuous smooth manifolds. We resort to an example to illustrate how this increase with depth can result from spurious effects. Consider a manifold of a given intrinsic dimension that undergoes a transformation where one of the coordinates is multiplied by a large factor. This operation would result in an elongated manifold that appears one-dimensional. The measured intrinsic dimensionality would consequently be one, despite the higher dimensionality of the manifold. In the context of neural networks, a network that operates on such an elongated manifold could effectively 'reduce' this extra, spurious dimension. This could result in an increase in the observed intrinsic dimensionality as a function of network depth, even though the actual dimensionality of the manifold did not change.
In the specific case of our data, the intrinsic dimensionality of the internal representations of deep CNNs monotonically decreases with depth, see Fig. 14, consistently with the idea proposed in the main text that the CNNs solve the problem by reducing the effective dimensionality of data layer by layer. We attribute this monotonicity to the absence of spurious or noisy directions that might lead to the counter-intuitive effect described above.  Figure 14: Effective dimension of the internal representation of a CNN trained on one instance of the RHM with m = n c = v, L = 3 resulting in P max = 6 232. Left: average nearest neighbor distance of input or network activations when probing them with a dataset of size P . The value reported on the y-axis is normalized by δ 0 = δ(P = 10). The slope of δ(P ) is used as an estimate of the effective dimension. Right: effective dimension as a function of depth. We observe a monotonic decrease, consistent with the idea that the dimensionality of the problem is reduced by DNNs with depth.