Hidden Unit Specialization in Layered Neural Networks: ReLU vs. Sigmoidal Activation

We study layered neural networks of rectified linear units (ReLU) in a modelling framework for stochastic training processes. The comparison with sigmoidal activation functions is in the center of interest. We compute typical learning curves for shallow networks with K hidden units in matching student teacher scenarios. The systems exhibit sudden changes of the generalization performance via the process of hidden unit specialization at critical sizes of the training set. Surprisingly, our results show that the training behavior of ReLU networks is qualitatively different from that of networks with sigmoidal activations. In networks with K>= 3 sigmoidal hidden units, the transition is discontinuous: Specialized network configurations co-exist and compete with states of poor performance even for very large training sets. On the contrary, the use of ReLU activations results in continuous transitions for all K: For large enough training sets, two competing, differently specialized states display similar generalization abilities, which coincide exactly for large networks in the limit K to infinity.


I. INTRODUCTION
T HE re-gained interest in artificial neural networks [1]- [5] is largely due to the successful application of so-called Deep Learning in a number of practical contexts, see e.g. [6]- [8] for reviews and further references.
The successful training of powerful, multi-layered deep networks has become feasible for a number of reasons including the automated acquisition of large amounts of training data in various domains, the use of modified and optimized architectures, e.g. convolutional networks for image processing, and the ever-increasing availability of computational power needed for the implementation of efficient training.
One particularly important modification of earlier models is the use of alternative activation functions [6], [9], [10]. Arguably, so-called rectified linear units (ReLU) constitute the most popular choice in Deep Neural Networks [6], [9]- [13]. Compared to more traditional activation functions, the simple ReLU and recently suggested modifications warrant computational ease and appear to speed up the training, see for instance [13]- [15]. The one-sided ReLU function is found to yield sparse activity in large networks, a feature which is frequently perceived as favorable and biologically plausible [6], [11], [16]. In addition, the problem of vanishing gradients, which arises when applying the chain rule in layered networks of sigmoidal units, is avoided [6]. Moreover, networks of rectified linear units have displayed favorable generalization behavior in several practical applications and benchmark tests, e.g. [9]- [13].
The aim of this work is to contribute to a better theoretical understanding of how the use of ReLU activations influences and potentially improves the training behavior of layered neural networks. We focus on the comparison with traditional sigmoidal functions and analyse non-trivial model situations.
To this end, we employ approaches from the statistical physics of learning, which have been applied earlier with great success in the context of neural networks and machine learning in general [1], [3], [17]- [21]. The statistical physics approach complements other theoretical frameworks in that it studies the typical behavior of large learning systems in model scenarios. As an important example, learning curves have been computed in a variety of settings, including on-line and offline supervised training of feedforward neural networks, see for instance [3], [17]- [27] and references therein. A topic of particular interest for this work is the analysis of phase transitions in learning processes, i.e. sudden changes of the expected performance with the training set size or other control parameters, see [19], [20], [27]- [32] for examples and further references.
Currently, the statistical physics of learning is being revisited extensively in order to investigate relevant phenomena in deep neural networks and other learning paradigms, see [33]- [40] for recent examples and further references.
In this work, we systematically study the training of layered networks in so-called student teacher settings, see e.g. [3], [17], [18], [20]. We consider idealized, yet non-trivial scenarios of matching student and teacher complexity. Our findings demonstrate that ReLU networks display training and generalization behavior which differs significantly from their counterparts composed of sigmoidal units. Both network types display sudden changes of their performance with the number of available examples. In statistical physics terminology, the systems undergo phase transitions at a critical training set size. The underlying process of hidden unit specialization and the existence of saddle points in the objective function have recently attracted attention also in the context of Deep Learning [34], [41], [42].
Before analysing ReLU networks, we confirm earlier theoretical results which indicate that the transition for large networks of sigmoidal units is discontinuous (first order): For small training sets, a poorly generalizing state is observed, in which all hidden units approximate the target to some extent and essentially perform the same task. At a critical size of the training set, a favorable configuration with specialized hidden units appears. However, a poorly performing state remains x Fig. 1. Left panel: Illustration of the network architecture with an Ndim. input layer, a set of adaptive weight vectors w k with k = 1,. . ., K (represented by solid lines) and total output σ given by the sum of hidden unit activations with fixed weights (dashed lines). Right panel: The considered activation functions: the sigmoidal g(x) = 1 + erf[x/ √ 2] (solid line) and the ReLU activation g(x) = max{0, x} (dashed line). metastable and the specialization required for successful learning can delay the training process significantly [28]- [31].
In contrast we find that, surprisingly, the corresponding phase transition in ReLU networks is always continuous (second order). At the transition, the unspecialized state is replaced by two competing configurations with very similar generalization ability. In large networks, their performance is nearly identical and it coincides exactly in the limit K → ∞.
In the next section we detail the considered models and outline the theoretical approach. In Sec. III our results are presented and discussed. We conclude with a summary and outlook on future extensions of this work.

II. MODEL AND ANALYSIS
Here we introduce the modelling framework, i.e. the considered student teacher scenarios. Moreover, we outline their analysis by means of statistical physics methods and discuss the simplifying assumption of training at high (formal) temperatures.

A. Network architecture and activation functions
We consider feed-forward neural networks where N input nodes represent feature vectors ξ ∈ R N . A single layer of K hidden units is connected to the input through adaptive weights W = w k ∈ R N K k=1 . The total real-valued output reads The quantity x k is referred to as the local potential of the hidden unit. The resulting activation is specified by the function g(x) and hidden to output weights are fixed to 1/ √ K. Figure 1 (left panel) illustrates the network architecture.
This type of network has been termed the Soft Committee Machine (SCM) in the literature due to its vague similarity to the committee machine for binary classification, e.g. [3], [18], [20], [27], [39], [43], [44]. There, the discrete output is determined by the majority of threshold units in the hidden layer, while the SCM is suitable for regression tasks.
We will consider two popular types of transfer functions: a) Sigmoidal activation Frequently, S-shaped transfer functions g(x) have been employed, which increase monotonically from zero at large negative arguments and saturate at a finite maximum for x → ∞. Popular examples are based on tanh(x) or the sigmoid (1 + e −x ) −1 , often with an additional threshold θ as in g(x − θ), or a steepness parameter controlling the magnitude of the derivative g . We study the particular choice with 0 ≤ g(x) ≤ 2, which is displayed in the right panel of Fig. 1. The relation to an integrated Gaussian facilitates significant mathematical ease, which has been exploited in numerous studies of machine learning models, e.g. [22]- [24]. Here, the function (2) serves as a generic example of a sigmoidal and its specific form is not expected to influence our findings crucially. As we argue below, the choice of limiting values 0 and 2 for small and large arguments, respectively, is also arbitrary and irrelevant for the qualitative results of our analyses.

b) Rectified Linear Unit (ReLU) activation
This particularly simple, piece-wise linear transfer function has attracted considerable attention in the context of multi-layered neural networks. It is given by which is illustrated in Fig. 1 (right panel). In contrast to sigmoidal activations, the response of the unit is unbounded for x → ∞. The function (3) is obviously not differentiable in x = 0. Here, we can ignore this mathematical subtlety and remark that it is considered irrelevant in practice [6]. Note also that our theoretical investigation in Sec. II does not relate to a particular realization of gradient-based training.
It is important to realize that replacing the above functions by g(x) = γ 1 + erf[x/ √ 2] in (a) or by g(x) = max{0, γ x} = γ max{0, x} in (b), where γ > 0 is an arbitrary factor, would be equivalent to setting the hidden unit weights to γ/ √ K in Eq. (1). Alternatively, we could incorporate the factor γ in the effective temperature parameter α of the theoretical analysis in Sec. II-D. Apart from this trivial re-scaling, our results would not be affected qualitatively.

B. Student and teacher scenario
We investigate the training and generalization behavior of the layered networks introduced above in a setup that models the learning of a regression scheme from example data. Assume that a given training set comprises P input output pairs which reflect the target task. In order to facilitate successful learning, P should be proportional to the number of adaptive weights in the trained system. In our specific model scenario the labels τ µ = τ (ξ µ ) are thought to be provided by a teacher SCM, representing the target input output relation The response is specified in terms of the set of teacher weight vectors W * = {w * m } M m=1 and defines the correct target output for every possible feature vector ξ. For simplicity, we will focus on settings with orthonormal teacher weight vectors and restrict the adaptive student configuration to normalized weights: with the Kronecker-Delta δ mn = 0 if m = n and δ mm = 1. Throughout the following, the evaluation of the student network will be based on a simple quadratic error measure that compares student output and target value. Accordingly, the selection of student weights W in the training process is guided by a cost or loss function which is given by the corresponding sum over all available data in D: By choosing the parameters K and M , a variety of situations can be modelled. This includes the learning of unrealizable rules (K < M ) and training of over-sophisticated students with K > M . Here, we restrict ourselves to the idealized, yet non-trivial case of perfectly matching student and teacher complexity, i.e. K = M , which makes it possible to achieve (ξ) = 0 for all input vectors.

C. Generalization error and order parameters
Throughout the following we consider feature vectors ξ µ in the training set with uncorrelated i.i.d. random components of zero mean and unit variance. Likewise, arbitrary input vectors ξ ∈ D are assumed to follow the same statistics: ξ µ j = 0, ξ µ j ξ ν k = δ j,k δ µ,ν , ξ j = 0 and ξ j ξ k = δ j,k . As a consequence of this assumption, the Central Limit Theorem applies to the local potentials x N which become correlated Gaussian random variables of order O(1). It is straightforward to work out the characteristic averages . . . and (co-)variances: The socalled order parameters R ij and Q ij for (i, j = 1, 2, . . . K) serve as macroscopic characteristics of the student configuration. The norms Q ii = 1 are fixed according to Eq. (6), while the symmetric Q ij = Q ji quantify the K(K − 1)/2 pairwise alignments of student weight vectors. The similarity of the student weights to their counterparts in the teacher network are measured in terms of the K 2 quantities R ij . Due to the assumed normalizations, the relations −1 ≤ Q ij , R ij ≤ 1 are obviously satisfied. Now we can work out the generalization error, i.e. the expected deviation of student and teacher output for a random input vector, given specific weight configurations W and W * . Note that SCM with g(x) = erf[x/ √ 2] have been treated in [23], [24] for general K, M. Here, we resort to the special case of matching network sizes, K = M, with We note here that matching additive constants in the student and teacher activations would leave g unaltered. As detailed in the Appendix, all averages in Eq. (9) can be computed analytically for both choices of the activation function g(x) in student and teacher network. Eventually, the generalization error is expressed in terms of very few macroscopic order parameters, instead of explicitly taking into account KN individual weights. The concept is characteristic for the statistical physics approach to systems with many degrees of freedom.
In the following, we restrict the analysis to student configurations which are site-symmetric with respect to the hidden units: (10) Obviously, the system is invariant under permutations, so we can restrict ourselves to one specific case with matching indices i = j in Eq. (10). While this assumption reflects the symmetries of the student teacher scenario, it allows for the specialization of hidden units: For R = S all student units display the same overlap with all teacher units. In specialized configurations with R = S, however, each student weight vector has achieved a distinct overlap with exactly one of the teacher units. Our analysis shows that states with both positive (R > S) and negative specialization (R < S) can play a significant role in the training process.
Under the above assumption of site-symmetry (10) and applying the normalization (6), the generalization error (9), see also Eqs. (26,28), becomes a) for g(x) = 1+erf x/ √ 2 in student and teacher [23]: In both settings, perfect agreement of student and teacher with g = 0 is achieved for C = S = 0 and R = 1. The scaling of outputs with hidden to output weights 1/ √ K in Eq. (1) results in a generalization error which is not explicitly Kdependent for uncorrelated random students: A configuration with R = C = S = 0 yields g = 1/3 in the case of sigmoidal activations (a), whereas g = 1 2 − 1 2π ≈ 0.341 for ReLU student and teacher.

D. Thermal equilibrium and the high-temperature limit
In order to analyse the expected outcome of training from a set of examples D, we follow the well-established statistical physics approach and analyse an ensemble of networks in a formal thermal equilibrium situation. In this framework, the cost function E is interpreted as the energy of the system and the density of observed network states is given by the so-called Gibbs-Boltzmann density where the measure dµ(W ) incorporates potential restrictions of the integration over all possible configurations of W = , for instance the normalization w 2 k = N for all k. This equilibrium density would, for example, result from a Langevin type of training dynamics where ∇ W denotes the gradient with respect to all KN degrees of freedom in the student network. Here, the minimization of E is performed in the presence of a δ-correlated, where δ(. . .) denotes the Dirac delta-function. The parameter β = 1/T controls the strength of the thermal noise in the gradient-based minimization of E. According to the, by now, standard statistical physics approach to off-line learning [1], [3], [17], [18] typical properties of the system are governed by the so-called quenched free energy where . . . D denotes the average over the random realization of the training set. In general, the evaluation of the quenched average ln Z D is technically involved and requires, for instance, the application of the replica trick [1], [3], [18].
Here, we resort to the simplifying limit of training at high temperature T → ∞, β → 0, which has proven useful in the qualitative investigation of various learning scenarios [17]. In the limit β → 0 the so-called annealed approximation [3], [17], [18] ln Z D ≈ ln Z D becomes exact. Moreover, we have Here, P is the number of statistically independent examples in D and E D = P (ξ) ξ = P g . As the exponent grows linearly with P ∝ N , the integral is dominated by the maximum of the integrand. By means of a saddle-point integration for N → ∞ we obtain Here, the right hand side has to be minimized with respect to the arguments, i.e. the order parameters {R ij , Q ij } . In Eq. (16) we have introduced the entropy term The quantity e N s corresponds to the volume in weight space that is consistent with a given configuration of order parameters. Independent of the activation functions or other details of the learning problem, one obtains for large N [30], [31] where C is the (2K × 2K)-dimensional matrix of all pair-wise and self-overlaps of the vectors The constant term is independent of the order parameters and, hence, irrelevant for the minimization in Eq. (16). A compact derivation of (18) is provided in, e.g., [31].
Omitting additive constants and assuming the normalization (6) and site-symmetry (10), the entropy term reads [30], [31] In order to facilitate the successful adaptation of KN weights in the student network we have to assume that the number of examples also scales like P = α K N. Training at high temperature additionally requires that α = αβ = O(1) for α → ∞, β → 0, which yields a free energy of the form β f (R, S, C) = α K g (R, S, C) − s(R, S, C).
The quantity α = βP/(KN ) can be interpreted as an effective temperature parameter or, likewise, as the properly scaled training set size. The high temperature has to be compensated by a very large number of training examples in order to facilitate non-trivial outcome. As a consequence, the energy of the system is proportional to g , which implies that training and generalization error are effectively identical in the simplifying limit.
III. RESULTS AND DISCUSSION In the following, we present and discuss our findings for the considered student teacher scenarios and activation functions.
In order to obtain the equilibrium states of the model for given values of α and K, we have minimized the scaled free energy (20) with respect to the site-symmetric order parameters. Potential (local) minima satisfy the necessary conditions In addition, the corresponding Hesse matrix H of second derivatives w.r.t. R, S, and C has to be positive definite. This constitutes a sufficient condition for the presence of a local minimum in the site-symmetric order parameter space. Furthermore, we have confirmed the stability of the local minima against potential deviations from site-symmetry by inspecting the full matrix of second derivatives involving the

A. Sigmoidal units re-visited
The investigation of SCM with sigmoidal g(x) = erf[x/ √ 2] with −1 < g(x) < 1 along the lines of the previous section has already been presented in [30]. A corresponding model with discrete binary weights was studied in [28].
As argued above, for g(x) = (1 + erf[x/ √ 2]), the mathematical form of the generalization error, Eqs. (11,26), and the free energy (βf ) are the same as for the activation erf[x/ √ 2]. Hence, the results of [30] carry over without modification. The following summarizes the key findings of the previous study, which we reproduce here for comparison.
For K = 2 we observe that R = S in thermal equilibrium for small α, see the upper row of graphs in Fig. 2. Both hidden units perform essentially the same task and acquire equal overlap with both teacher vectors, when trained from relatively small data sets. At a critical value α c (2) ≈ 23.7, the system undergoes a transition to a specialized state with R > S or R < S in which each hidden unit aligns with one specific teacher unit. Both configurations are fully equivalent due to the invariance of the student output under exchange of the student weights w 1 and w 2 for K = 2. The specialization process is continuous with the quantity |R − S| increasing proportional to (α−α c (K)) 1/2 near the transition. This results in a kink in the continuous learning curve g (α) at α c , as displayed in the upper right panel of Fig. 2.
Interestingly, a different behavior is found for all K ≥ 3. The following regimes can be distinguished:  Within this subspace, a rapid initial decrease of g with α is achieved. (b) α s (K) ≤ α < α c (K): In α s (K), a specialized configuration with R > S appears as a local minimum of the free energy. The R = S configuration corresponds to the global minimum up to α c (K). At this K-dependent critical value, the free energies of the competing minima coincide. (c) α > α c (K): Above α c , the configuration with R > S constitutes the global minimum of the free energy and, thus, the thermodynamically stable state of the system. Note that the transition from the unspecialized to the specialized configuration is associated with a discontinuous change of g , cf. Fig. 2 (lower right panel). The (R > S) specialized state facilitates perfect generalization in the limit α → ∞. (d) α ≥ α d (K): In addition, at another characteristic value α d , the (R = S) local minimum disappears and is replaced by a negatively specialized state with R < S. Note that the existence of this local minimum of the free energy was not reported in [30]. The observed specialization (S − R) increases linearly with (α − α d ) for α ≈ α d . This smooth transition does not yield a kink in g (α). A careful analysis of the associated Hesse matrix shows that the R < S state of poor generalization persists for all α > α d , indeed.
The limit K → ∞ with K N has also been considered in [30]: The discontinuous transition is found to occur at α s (K → ∞) ≈ 60.99 and α c (K → ∞) ≈ 69.09. Interest- Fig. 4. ReLU activation: Learning curves of the perfectly matching student teacher scenario for K → ∞. In this limit, the continuous transition occurs at αc = 2π. In the left panel, the solid line represents the specialized solution with R(α) > 0, while the chain line marks the solution with R(α) < 0. In the former, S → 0 for large α, while in the latter, S remains positive with S = O(1/K) for large K. The learning curves g (α) for the competing minima of βf coincide for K → ∞ as displayed in the right panel. It approaches perfect generalization, i.e. g → 0 for α → ∞ ingly, the characteristic value α d diverges as α d (K) = 4πK for large K [30]. Hence, the additional transition from R = S to R < S cannot be observed for data sets of size P ∝ KN . On this scale, the unspecialized configuration persists for α → ∞. It displays site-symmetric order parameters R = S = O(1/K) with R, S > 0 and C = O(1/K 2 ), see [30] for details. Asymptotically, for α → ∞, they approach the values R = S = 1/K and C = 0 which yields the non-zero generalization error g (α → ∞) = 1/3 − 1/π ≈ 0.0150. On the contrary, the R > S specialized configuration achieves g → 0, i.e. perfect generalization, asymptotically.
The presence of a discontinuous specialization process for sigmoidal activations with K ≥ 3 suggests that -in practical training situations -the network will very likely be trapped in an unfavorable configuration unless prior knowledge about the target is available. The escape from the poorly generalizing metastable state with R = S or R < S requires considerable effort in high-dimensional weight space. Therefore, the success of training will be delayed significantly.

B. Rectified linear units
In comparison with the previously studied case of sigmoidal activations, we find a surprisingly different behavior in ReLU networks with K ≥ 3.
For K = 2, our findings parallel the results for networks with sigmoidal units: The network configuration is characterized by R = S for α < α c (K) and specialization increases like near the transition. This results in a kink in the learning curve g (α) at α = α c (K) as displayed in Fig. 3 (upper row) for K = 2 with α c (2) ≈ 6.1. However, in ReLU networks the transition is also continuous for K ≥ 3. Figure 3 (lower row of graphs) displays the results for the example case K = 10 with α c (10) ≈ 6.2 (lower row).
The student output is invariant under exchange of the hidden unit weight vectors, consistent with an R = S unspecialized state for small α. At a critical value α c (K) the unspecialized (R = S) configuration is replaced by two minima of βf : in the global minimum we have R > S, while the competing local minimum corresponds to configurations with R < S. The Right panel: In the ReLU system with, e.g., K = 10, C becomes positive before the continuous transition occurs, it reaches a maximum in αc and approaches zero from above for α → ∞ in the specialized configuration with R > S (solid line). In the local minimum of βf with R < S, C becomes negative for large α as marked by the dotted line.
In contrast to the case of sigmoidal activation, both competing configurations of the ReLU system display very similar generalization behavior. While, in general, only states with R > 0 can perfectly reproduce the teacher output, the student configurations with S > 0 and R < 0 also achieve relatively low generalization error for large α, see Fig. 3 (lower row) for an example.
The limiting case of large networks with K → ∞ can be considered explicitly. We find for large ReLU networks that the continuous specialization transition occurs at α c (K → ∞) = 2π ≈ 6.28.
In the configuration with R < 0 the order parameters display the scaling behavior for large K. In Appendix D we show how a single teacher ReLU with activation max(0, x * ) can be approximated by (K − 1) weakly aligned units in combination with one anticorrelated student node. While the former effectively approximates a linear response of the form const. + x * , the unit with R = −1 implements max(0, −x * ). Since max(0, x * ) = max(0, −x * ) + x * the student can approximate the teacher output very well, see also the appendix for details. In the limit K → ∞, the correspondence becomes exact and facilitates perfect generalization for α → ∞.
Note that a similar argument does not hold for student teacher scenarios with sigmoidal activation functions which do not display the partial linearity of the ReLU.

C. Student-student overlaps
It is also instructive to inspect the behavior of the order parameter C which quantifies the mutual overlap of student weight vectors. In the ReLU system with large finite K, we observe C(α) = O(1/K 2 ) > 0 before the transition. It reaches a maximum value at the phase transition and decreases with increasing α > α c . In the positively specialized configuration it approaches the limiting value C(α → ∞) = 0 from above, while it assumes negative values on the order O(1/K 2 ) in the configuration with R < S. This is in contrast to networks of sigmoidal units, where C < 0 before the discontinuous transition and in the specialized (R > S) state, see [30], [31] for details. Interestingly, the characteristic value α d coincides with the point where C becomes positive in the suboptimal local minimum of βf. Figure 5 displays C(α) for sigmoidal (left panel) and ReLU activation (right panel) for K = 5 as an example. Apparently the ReLU system tends to favor correlated hidden units in most of the training process.

D. Practical relevance
It is important to realize that a quantitative comparison of the two scenarios, for instance w.r.t. the critical values α c , is not sensible. The complexities of sigmoidal and ReLU networks with K units do not necessarily correspond to each other. Moreover, the actual α-scale is trivially related to a potential scaling of the activation functions.
However, our results provide valuable qualitative insight: The continuous nature of the transition suggests that ReLU systems should display favorable training behavior in comparison to systems of sigmoidal units. In particular, the suboptimal competing state displays very good performance, comparable to that of the properly specialized configuration. Their generalization abilities even coincide in large networks of many hidden units.
On the contrary, the achievement of good generalization in networks of sigmoidal units will be delayed significantly due to the discontinuous specialization transition which involves a poorly generalizing metastable state.

IV. CONCLUSION AND OUTLOOK
We have investigated the training of shallow, layered neural networks in student teacher scenarios of matching complexity. Large, adaptive networks have been studied by employing modelling concepts and analytical tools borrowed from the statistical physics of learning. Specifically, stochastic training processes at high formal temperature were studied and learning curves were obtained for two popular types of hidden unit activation.
To the best of our knowledge, this work constitutes the first theoretical, model-based comparison of sigmoidal hidden unit activations and rectified linear units in feed-forward neural networks.
Our results confirm that networks with K ≥ 3 sigmoidal hidden units undergo a discontinuous transition: A critical training set size is required to facilitate the differentiation, i.e. specialization of hidden units. However, a poorly performing state of the network persists as a locally stable configuration for all sizes of the training set. The presence of such an unfavorable local minimum will delay successful learning in practice, unless prior knowledge of the target rule allows for non-zero initial specialization.
On the contrary, the specialization transition is always continuous in ReLU networks. We show that above a weakly K-dependent critical value of the re-scaled training set size α, two competing specialized configurations can be assumed. Only one of them displays positive specialization R > S and facilitates perfect generalization from large training sets for finite K. However, the competing configuration with negative specialization R < 0, S > 0 realizes similar performance which is nearly identical for networks with many hidden units and coincides exactly in the limit K → ∞.
As a consequence, the problem of retarded learning associated with the existence of metastable configurations is expected to be much less pronounced in ReLU networks than in their counterparts with sigmoidal activation.
Clearly, our approach is subject to several limitations which will be addressed in future studies.
Probably the most straightforward, relevant extension of our work would be the consideration of further activation functions, for instance modifications of the ReLU such as the leaky or noisy ReLU or alternatives like swish and max-out [9], [10].
Within the site-symmetric space of configurations, cf. Eq. (10), only the specialization of single units with respect to one of the teacher units can be considered. In large networks, one would expect partially specialized states, where subsets of hidden units achieve different alignment with specific teacher units. Their study requires the extension of the analysis beyond the assumption of site-symmetry.
Training at low formal temperatures can be studied along the lines of [31] where the replica formalism was already applied to networks with sigmoidal activation. Alternatively, the simpler annealed approximation could be used [3], [17], [18]. Both approaches allow to vary the control parameter β of the training process and the scaled example set size α = P/(KN ) independently, as it is the case in more realistic settings. Note that the findings reported in [31] for sigmoidal activation displayed excellent qualitative agreement with the results of the much simpler high-temperature analysis in [30].
The dynamics of non-equilibrium on-line training by gradient descent has been studied extensively for soft-committeemachines with sigmoidal activation, e.g. [23]- [26]. There, quasi-stationary plateau states in the learning dynamics are the counterparts of the phase transitions observed in thermal equilibrium situations. First results for ReLU networks have been obtained recently [45]. These studies should be extended in order to identify and understand the influence of the activation function on the training dynamics in greater detail.
Model scenarios with mismatched student and teacher complexity will provide further insight into the role of the activation function for the learnability of a given task. It should be interesting to investigate specialization transitions in practically relevant settings in which either the task is unlearnable (K < M ) or the student architecture is over-sophisticated for the problem at hand (K > M ). In addition, student and teacher systems with mismatched activation functions should constitute interesting model systems.
The complexity of the considered networks can be increased in various directions. If the simple shallow architecture of Eq. (1) is extended by local thresholds and hidden to output weights that are both adaptive, it parameterizes a universal approximator, see e.g. [46]- [48]. Decoupling the selection of these few additional parameters from the training of the input to hidden weights should be possible following the ideas presented in [49].
Ultimately, deep layered architectures should be investigated along the same lines. As a starting point, simplifying tree-like architectures could be considered as in e.g. [27], [39].
Our modelling approach and theoretical analysis goes beyond the empirical investigation of data set specific performance. The suggested extensions bear the promise to contribute to a better, fundamental understanding of layered neural networks and their training behavior.

APPENDIX
A. Co-variance matrix and order parameters The (K+M )×(K+M )-dim. matrix of order parameters reads Note that Eqs. (18) and (19) correspond to the special case of K = M and exploit site-symmetry (10) and normalization (6).

B. Derivation of the generalization error
Here we give a derivation of the generalization error in terms of the order parameters for sigmoidal and ReLU student and teacher. For general K and M it reads which reduces to Eq. (9) for K = M. To obtain g for a particular choice of activation function g, expectation values of the form g(x)g(y) have to be evaluated over the joint normal density of the hidden unit local potentials x and y, i.e. P (x, y) = N (0, C) with the appropriate submatrix C of C, cf. Eq. (24): 1) Sigmoidal: For student and teacher with sigmoidal activation functions g(x) = erf x/ √ 2 or g(x) = 1 + erf[x/ √ 2] , the generalization error has been derived in [24] and is given by: 2) ReLU: For student and teacher with ReLU activations g(x) = max{0, x}, applying the elegant formulation used in [50] gives an analytic expression for the two-dimensional integrals: g(x)g(y) = max{0, x}max{0, y} = For K = M , orthonormal teacher vectors with T ij = δ ij , fixed student norms Q ii = 1, and assuming site symmetry, Eq. (10), we obtain Eqs. (11) and (12), respectively.

C. Single unit student and teacher
In the simple case K = 1 with a single unit as student and teacher network, we have to consider only one order parameter R = w · w * /N. Assuming w · w/N = w * · w * /N = 1, we obtain the free energy (βf ) = α g − s with s = 1 2 ln[1 − R 2 ] + const.

D. Weak and negative alignment
Here we consider a particular teacher unit which realizes a ReLU response max(0, x * ) with x * = w * · ξ/ √ N . A set of K hidden units in the student network can obviously reproduce the response by aligning one of the units perfectly with, e.g., R = w 1 ·w * /N = 1 and S = w j ·w * /N = 0 for j > 1. Similarly, we obtain for R = −1 that x 1 = −w * · ξ and max(0, x 1 ) = max(0, −x * ). Now consider the mean response of a student unit with small positive overlap S = w j · w * /N , given the teacher unit response x * . It corresponds to the average g(x j ) x * over the conditional density P (x j |x * ) = P (x j , x * )/P (x * ). One obtains g(x j ) x * = 1/ √ 2π + S x * /2 + O(S 2 ) by means of a Taylor expansion for S ≈ 0. As a special case, the mean response of an orthogonal unit with S = 0 is 1/ √ 2π, independent of x * .
It is straightforward to work out the conditional average of the total student response for a particular order parameter configuration with R = −1 and S = 2/(K − 1). Apart from the prefactor 1/ √ K it is given by max(−x * , 0) + x * + K−1 √ 2π = max(0, x * ) + K−1 √ 2π , where the right hand side coincides with the expected output for R = 1 and S = 0. Hence, the average response agrees with the teacher output for large K. Moreover, the correspondence becomes exact in the limit K → ∞, which facilitates perfect generalization in the negatively specialized state with S > 0, R < 0 discussed in Sec. III.