Statistical mechanical analysis of learning dynamics of two-layer perceptron with multiple output units

The plateau phenomenon, wherein the loss value stops decreasing during the process of learning, is troubling. Various studies suggest that the plateau phenomenon is frequently caused by the network being trapped in the singular region on the loss surface, a region that stems from the symmetrical structure of neural networks. However, these studies all deal with networks that have a one-dimensional output, and networks with a multidimensional output are overlooked. This paper uses a statistical mechanical formalization to analyze the dynamics of learning in a two-layer perceptron with multidimensional output. We derive order parameters that capture macroscopic characteristics of connection weights and the differential equations that they follow. We show that singular-region-driven plateaus diminish or vanish with multidimensional output, in a simple setting. We found that the more non-degenerative (i.e. far from one-dimensional output) the model is, the more plateaus are alleviated. Furthermore, we showed theoretically that singular-region-driven plateaus seldom occur in the learning process in the case of orthogonalized initializations.


Introduction
Since deep learning was proposed in the late 2000s, neural networks have received much attention. That is because they enabled us to solve real-world tasks in various fields, including image recognition, speech recognition and machine translation tasks, with performance far exceeding conventional methods. However, there are some problems with the neural networks left behind. One of them is the 'plateau phenomenon', the main topic of this study, which we describe in detail below.
The perceptron is one representative of machine learning methods. Although it was first proposed in 1958 [1], there was no efficient learning algorithm at that time, and it became obsolete. In 1985, with discovery of the backpropagation method [2], which is a fundamental learning algorithm of neural networks, neural networks began to gain attention once again. However, another problem occurred; that is the 'plateau phenomenon', wherein the learning slows down partway through. In the learning process of neural networks, weight parameters of the neural network are updated iteratively so that the loss (gap between the network's output and desired output) decreases. However, typically the loss does not decrease simply, but its decreasing speed slows down significantly partway through learning, and then it speeds up again after a long period of time (see the black line in figure 2(a) and figure S1 in the supplemental material (stacks.iop.org/JPhysA/52/184002/mmedia) for example). This is what we call the 'plateau phenomenon'. The phenomenon is observed ubiquitously in the learning of hierarchical models, including neural networks, radial basis function networks and a mixture of expert models [3][4][5][6][7][8][9][10]. However, in recent years, although many researchers and engineers train hierarchical neural networks, the plateau phenomenon is rarely observed in practical applications of deep learning. Why is that?
With regard to theoretical studies of learning dynamics, that of linear neural networks have been studied analytically [11][12][13]. However, the plateau phenomenon is specific to nonlinear neural network, which has a nonlinear activation function. Although studying the learning dynamics of nonlinear neural networks is challenging, the underlying mechanism of the plateau phenomenon has been widely studied. It is known empirically that neural networks get trapped into a plateau because they have symmetrical weights such that their input-output relationship is invariant under swapping two units of a hidden layer. The learning dynamics of neural networks have been studied in various settings. Some past studies, using statistical mechanical formalization, derived the learning dynamics of a two-layer soft committee machine [4] and a two-layer perceptron [3]; this research showed that the weight symmetry results in saddle points that cause plateauing. Furthermore, it has been recognized that such a plateau phenomenon stems from a singular structure in the parameter space (see [9,[14][15][16], or [17] for a review). In the parameter space, a Riemannian metric is naturally induced by a Fisher information matrix, which represents how two models identified by slightly different parameters differ as statistical models [18]. This metric is not necessarily regular, but there are regions in the parameter space in which the metric degenerates, called singular regions [14]. More specifically, when we consider a two-layer neural network, if it has two identical weight vectors projected from the input layer to two different hidden units, its input-output relationship can be realized by another model which has one lesser hidden units than the original model. If this downsized model gives local minima of the loss in the downsized parameter space, the original model parameter is in the singular region (see also the section 'Singular regions and plateaus' for detailed explanation). The gradient of the loss is zero everywhere within the singular region. Although an isolated saddle point has the same property, the singular region and a saddle point is different to an isolated saddle point in two ways; the singular region has one-or-more dimensions, and it is a Milnor-like attractor [19], that is, it has a region of attraction which has nonzero measure (see figure 5 in [14] for a schematic diagram of the singular region). However, all of this research assumes one-dimensional output; in other words, networks that have multiple output units are overlooked, and they may avoid the plateau phenomena, which is a usual situation in modern deep learning and seems to be rational.
For these reasons, we analyze the learning dynamics of a two-layer nonlinear perceptron that has multiple output units, with statistical mechanical formulation. First, we introduce the student-teacher learning setting for ease of analysis [4,20]. Second, we introduce several order parameters, which can capture macroscopic characteristics of network weights, and derive their evolution equations from that of microscopic variables (i.e. network weights). We solve derived equations numerically to obtain macroscopic learning dynamics, with which we can discuss plateau phenomenon quantitatively. Under a simple setting, we show that singularregion-driven plateaus diminish or vanish with multidimensional output. We find that the less degenerative (i.e. like one-dimensional output) the model is, the less the model approaches the singular region, and then the more the plateau shortens. Further, we show theoretically that singular-region-driven plateaus seldom occur in the learning process if the student and teacher models are initialized orthogonally.
Note that the claim that having multiple output units might alleviate plateau phenomenon is hypothesized by intuitive insight in our previous work [21], but in the current work we examine the hypothesis by experiments and theoretical analyses and show that multiple output units can indeed prevent approaching the singular region and vanish plateaus.

Model
We considered a neural network with an input layer (size N), a hidden layer (size K), and an output layer (size O). The network receives input data ξ ∈ R N and calculates out- where g is the activation function. The parameters J and w are optimized during learning depending on the difference between the output s and the teacher data t. We considered an ideal situation in which the teacher data t is determined as t = M n=1 v n g(B n · ξ) ∈ R O ; in other words, the learning network (the 'student network') learns the input-output relationship of the 'teacher network', which is also a two-layer network and has N-M-O units and original fixed weights B and v (figure 1(a)). We assumed the squared loss function ε = 1 2 s − t 2 , which is most commonly used for regression.
For the statistical mechanical formulation of online learning, we introduced further idealization. We assumed that the dimension of input data N is very large and that each element of input data ξ is generated in accordance with i.i.d. normal distribution, N (ξ i |0, 1). We put x i := J i · ξ and y n := B n · ξ and define Q ij := J i · J j = x i x j , R in := J i · B n = x i y n , T nm := B n · B m = y n y m and D ij : The parameters Q ij , R in , T nm , D ij , E in , and F nm defined above capture the state of the system macroscopically; they are therefore called 'order parameters'. The first three represent the state of the first layers of the two networks (student and teacher), and the latter three represent their second layers' state ( figure 1(b)). Roughly speaking, Q represents the norm of the student's first layer and T represents that of the teacher's first layer. R is related to similarity between the student and teacher's first layer. D, E, F is the second layers' counterpart of Q, R, T. Among these six order parameter matrices, the values of Q ij , R in , D ij , and E in change during learning; their dynamics are what is to be determined, from the dynamics of microscopic variables, i.e. connection weights.

Dynamics of order parameters
In this paper, we adopt the stochastic gradient descent (SGD) learning algorithm, which underlies all conventional algorithms of neural networks used in practice. That is, every time an input sample is given, the output and sample loss is computed, the differentiation of the sample loss with respect to model parameters is computed, and finally, model parameters are moved against the gradient of the loss. The update rule of connection weights with SGD is written as in which we set the learning rate as η/N , in order to obtain an N-independent macroscopic system. The first equation of (1) gives the update rule of order parameters Q ij and R in in the form of difference equations: Since ξ 2 ≈ N and the right hand sizes of these equations are O(N −1 ), we can replace these difference equations with differential ones with N → ∞, by taking the expectation over all input vectors ξ: In these equations, α := α/N represents time (normalized number of steps), and the brackets · represent the expectation when the input ξ follows N (ξ i |0, 1), that is, when . Likewise, the difference equations of the second layers' order parameters D ij and E in are obtained by the second equation of (1) as Again, the right hand sides are O(N −1 ), therefore these difference equations can be rewritten to differential versions with N → ∞, by taking the expectation over all input vectors ξ: These differential equations govern the macroscopic dynamics. Note that the generalization loss ε g , the expectation of loss value ε(ξ) = 1 2 s − t 2 over all input vectors ξ, is represented as Expectation terms I 2 , I 3 and I 4 can be analytically determined for some activation functions g, including sigmoid-like g(x) = erf(x/ √ 2) [4], and g(x) = ReLU(x) =: max{0, x} which is commonly used in deep learning [22,23].

Singular regions and plateaus
In this section we review the concept of singular regions as the cause of the plateau phenomenon [14].
In general, the input-output relationship (i.e. mapping) of a neural network is determined by the parameter values (i.e. connection weights) of the network. However, this correspondence is not one-to-one; in other words, there could be multiple settings of parameter values that result in one specific model (i.e. input-output mapping). For example, consider two-layer networks that have two hidden units (i.e. K = 2) and one output unit (i.e. O = 1) as where the parameter is (J 1 , J 2 , w 1 , w 2 ). Then, all of the parameter settings in the parameter regions correspond to the same model (i.e. input-output mapping): These parameter regions R i are called the singular region. The model (11) can be regarded as a K = 1 model, parameterized by J * and w * . Now suppose that ∂ε g ∂w * = 0 and ∂ε g ∂J * = 0; (12) note that this occurs when the K = 1 model (11) gives a local minima of the generalization loss ε g in the K = 1 parameter space. Then, one can show that the gradient of the generalization loss is also zero throughout the singular regions R 1 and R 2 in the original K = 2 parameter space; that is, ∂ε g /∂w i = 0 and ∂ε g /∂J i = 0 if (J 1 , J 2 , w 1 , w 2 ) ∈ R 1 ∪ R 2 , provided that (12) holds. Moreover, the singular region R 1 has the following properties [24]: • The region R 1 is partially stable. When the parameter value is in the stable part of R 1 , it undergoes a long period of random walk, by fluctuations due to the random sampling of ξ. Once the parameter value has reached the unstable part of the region, it starts moving away from the region. • The region R 1 is a Milnor-like attractor [19] in the sense that it has a positive measure of basin of attraction. This means that a randomly chosen initial parameter value will get trapped in the region with nonzero probability.
From these points, the singular region is completely different from a saddle point. When trapped in the singular region, the learning process inevitably slows down. That is why the singular region is considered to be the cause of plateaus. However, the concept of problematic plateaus described above may not be true when a network has multiple outputs (O > 1); the loss gradient does not necessarily vanish at the singular region, and the network might not be attracted to the singular region [21]. Thus, we analyzed the learning dynamics of networks that have multiple output units and examined whether or not the networks were trapped in the singular region and plateaus during learning.

Numerical results
We discuss the dynamics of learning in a two-layer perceptron by numerically solving the differential equations of the order parameters (3) and (6). For simplicity, we focus on the case with K = M = 2 units in the hidden layers and O = 2 units in the output layers. In the numerical experiments described below, we initialize the weights of the first layers of the student and teacher networks with . This initialization makes the initial values of the first layers' order parameters on average. When N is finite, each component of the matrix Q, R and T has O(N −1 ) noise; it vanishes as N → ∞. It is critical how we initialize the weights of the second layers of the student and teacher networks, v i and w i . For example, consider the case where c is an arbitrary two-dimensional constant vector whose norm is 1. In this setting, the outputs of the teacher and student networks, s, t ∈ R 2 , are confined in the one-dimensional subspace along c. This makes the learning process equivalent to one with one output unit. In fact, when (14) holds, where we defined scalar t and s as M n=1 v n g(B n · ξ) and K i=1 w i g(J i · ξ), respectively. And from the update rule (1), which is simply the update rule when both networks have only one output unit. Thus, how the student network with two-dimensional output learns the teacher network with two-dimensional output largely depends on the initial condition of the weight matrices of their second layers. To see this, we consider an initial condition parameterized by θ: where c is again an arbitrary two-dimensional constant vector whose norm is 1, and c θ is a constant vector which is obtained by rotating c by θ. We refer to the parameter θ as non-degeneracy because it regulates the degeneracy of the weight matrices of the second layers of both networks. We can test various situations by changing θ continuously; θ = 0 makes both matrices degenerate, and θ = π/2 makes both matrices orthogonal. The former situation, θ = 0, is equivalent to the one-dimensional output situation, as previously described. The initial condition of the second layers' order parameters D ij , E in , F nm , corresponding to (17), is given by Putting these initial conditions together, we examined the learning dynamics in two ways: by simulating the original microscopic system with finite N, i.e. conducting stochastic gradient descent in accordance with the update rule (1), and by solving the differential equations (3) and (6) of the order parameters numerically under initial conditions that match the initial weights used when simulating the original microscopic system. The black lines in figure 2 show the time courses of the generalization loss ε g in several typical situations: (a) θ = 0, (b) θ = π/8, (c) θ = π/4, and (d) θ = π/2. To evaluate quantitatively how near the student network is to the singular region R 1 and R 2 , we calculated two measures: the overlap of vectors J 1 and J 2 , i.e. m (1) , and the minimum norm of vectors w 1 and w 2 , i.e. l (2) 12 measures proximity to the region R 1 , and l (2) min indicates the distance to the region R 2 . Figure 2 also shows the time evolutions of these measures with blue and red lines, respectively. Results of microscopic simulations are also shown by dots. In every plot in figure 2, the solid lines and dots agree well, meaning that the macroscopic system of order parameters appropriately captures the microscopic system of connection weights. In figure 2(a), the generalization loss ε g stops during the first ∼1800 steps, along with high values of m (1) 12 , meaning that falling into the singular region R 1 is derived from the network's symmetry. In figure 2(b), the plateau shortens. An increase in m (1) 12 is still observed, although its peak is lower than figure 2(a). In figures 2(c) and (d), there is no apparent plateau. In particular, the overlap m (1) 12 is always 0 in figure 2(d), signifying that the student network does not approach the singular region R 1 at all during learning. Note also that no approach to the singular region R 2 , indicated by high values of l (2) min , is observed in any plot in figure 2 at any time. Figure 3 shows the times where the plateau begins and ends, depending on θ, calculated by the numerical solutions of the order parameters. Here we define plateaus as 'where the logarithm of generalization loss decreases at a rate slower than a half of the typical rate', where the typical rate is measured as the rate when ε g < 10 −10 is achieved. We found that the plateau is observed if and only if roughly |θ| < 0.1π holds. Note that our quantitative definition of plateaus above contains some arbitrariness, but it does not affect the main point; the plateau phenomenon get alleviated as |θ| increases, as is evident from figure 2.

Cases for orthogonal second layers
Up to this point, we considered cases in which the second layers of two networks have identical weights, as (17). However, this is not practical because it means that the student knows about the teacher in advance. Thus, we consider a slightly different situation: wherein the student matrix and teacher matrix are not identical but are both orthogonal. The initial conditions of the order parameters D ij , E in , F nm , corresponding to (19), are given by Under these initial conditions, we found that the orthogonality w 1 ⊥ w 2 remains true during learning at any time, as formally stated in the proposition below. In other words, we prove that the student network stays opposite the singular region R 1 .

Proposition. Assume K = M = 2 and O 2, and consider the situation in which N → ∞.
Assume that the activation g is an odd function and that the right sides of the differential equations are Lipschitz continuous in the vicinity of the solution trajectory (Q, R, D, E) during learning. If both the teacher and student network have a pair of orthogonal column vectors in its second layer's weight matrix, that is, v 1 ⊥ v 2 and w 1 ⊥ w 2 hold at the initial state, then the orthogonality w 1 ⊥ w 2 holds at any time during learning.
We give the proof of the proposition in the appendix. The numerical solution of the differential equations of the order parameters, under the initial conditions described above, is shown in figure 4. These solutions tell us that there is no approach to the singular regions R 1 and R 2 and that no plateau is seen, except for the case of These lines indicate how student network is close to singular regions R 1 and R 2 , respectively. (a) θ = 0, (b) θ = π/8, (c) θ = π/4, (d) θ = π/2. Simulation results of microscopic systems shown by dots (diamonds, circles, and triangles for ε g , m (1) 12 and l (2) min , respectively). Generalization loss of simulation results approximated by averaged sample loss (ε) over 100 contiguous steps. Simulation parameters: N = 1000, η = 0.1 (η/N = 0.0001). Activation function: . Simulation results of microscopic systems shown by dots (diamonds, circles, and triangles for ε g , m (1) 12 and l (2) min , respectively). Generalization loss of simulation results approximated by averaged sample loss (ε) over 100 contiguous steps. Simulation parameters: N = 1000, η = 0.1 (η/N = 0.0001). Activation: g(x) = erf(x/ √ 2). φ = π/4 (figure 4(d)); this plateau is not due to the singular regions but rather to being stuck at a saddle point where the weight vectors of the second layer of the student network are perturbed by the stochastic gradient and can easily escape.

Simulation results for more practical cases
By using order parameters we analyzed theoretically the case with two hidden units. However, networks with a greater number of hidden units are usually used in practice. Also, the setting above assumes an infinite number of samples which is not available in reality. Furthermore, various optimization techniques developed in recent years such as mini-batch, dropout and gradient descent with adaptive learning rate are widely used. In these practical cases the learning dynamics might not be tractable. We examined such cases by numerical simulations of microscopic systems. In our experiment, a student network and a teacher network both of which have 100 input units and ten hidden units are used. First we generated 4000 training samples and 1000 test samples by using the teacher network whose weights are randomly chosen. We then trained the student network whose weights are randomly initialized using training samples only, and computed the training loss and the test loss after every epoch.
In figure 5, the dynamics of the learning in the experiment is shown. (Although this dynamics depends on the initial weights, the qualitative shapes of the plateaus do not; see supplemental material.) The output dimension is 1 in figures 5(a) and (c), and 10 in figures 5(b) and (d). These results apparently indicate that the plateau is alleviated by multiple output units.
We also examined the cases with an Adam optimizer, bias terms and dropout regularization. All these results are consistent with the idea that multiple outputs mitigate the plateaus due to the singular regions (see the supplemental material).

Conclusion
We analyzed the learning dynamics of two-layer networks that have multiple output units by means of statistical mechanical formulation. By defining order parameters, deriving the differ ential equations they follow, and solving said equations, we clarified experimentally and theoretically that multiple-output networks are less likely to be trapped in plateaus because of singularity than single-output networks are.
Through this paper, we suggest reconsidering the established idea that singular structures cause plateaus and they interrupt learning. However, more general cases, such as cases with deeper neural networks, have yet to be researched.

Acknowledgment
This work was supported by JSPS KAKENHI Grant-in-Aid for Scientific Research(A) (No. 18H04106).

Appendix. Proof of continued orthogonality during learning
This section gives the proof of the proposition in the main text: Proposition. Assume K = M = 2 and O 2, and consider the situation in which N → ∞. Assume that the activation g is an odd function and that the right sides of the differential equations are Lipschitz continuous in the vicinity of the solution trajectory (Q, R, D, E) during learning. If both the teacher and student network have a pair of orthogonal column vectors in its second layer's weight matrix, that is, v 1 ⊥ v 2 and w 1 ⊥ w 2 hold at the initial state, then the orthogonality w 1 ⊥ w 2 holds at any time during learning.
Proof. We prove the following claim: suppose K = M = 2 and O 2, and the activation g is an odd function, then the differential equations (3) and (6)  where I denotes the 2 × 2 identity matrix, and X ∝ I means that there exists c ∈ R such that X = cI.
Proving this claim (A.1) is sufficient for the proof of the proposition. Suppose that Θ (α) = (Q(α),R(α),D(α),Ê(α)) is one solution of the differential equation. What we have to show is that the solution always lies in the subspace If we denote by PΘ the vector from Θ to the foot of its perpendicular to S (note that this mapping is linear), what we have to show is for all time α. We have by substituting the initial condition (13) and (20), and we have for given time α. These imply that f (α) ≡ 0 is one solution of the differential equation for f . Lipschitz continuity ensures that the uniqueness of the solution of the differential equation, therefore we have f (α) ≡ 0 for all α.
To prove the claim (A.1), we first show the following lemma.
Lemma A.1. If the condition (*) in the claim (A.1) holds, the following relations hold: where each of a, b, c and d is either x or y , and {i e , j e } = {1, 2} for e = 1, 2, 3, 4.