Statistical Mechanical Analysis of Catastrophic Forgetting in Continual Learning with Teacher and Student Networks

When a computational system continuously learns from an ever-changing environment, it rapidly forgets its past experiences. This phenomenon is called catastrophic forgetting. While a line of studies has been proposed with respect to avoiding catastrophic forgetting, most of the methods are based on intuitive insights into the phenomenon, and their performances have been evaluated by numerical experiments using benchmark datasets. Therefore, in this study, we provide the theoretical framework for analyzing catastrophic forgetting by using teacher-student learning. Teacher-student learning is a framework in which we introduce two neural networks: one neural network is a target function in supervised learning, and the other is a learning neural network. To analyze continual learning in the teacher-student framework, we introduce the similarity of the input distribution and the input-output relationship of the target functions as the similarity of tasks. In this theoretical framework, we also provide a qualitative understanding of how a single-layer linear learning neural network forgets tasks. Based on the analysis, we find that the network can avoid catastrophic forgetting when the similarity among input distributions is small and that of the input-output relationship of the target functions is large. The analysis also suggests that a system often exhibits a characteristic phenomenon called overshoot, which means that even if the learning network has once undergone catastrophic forgetting, it is possible that the network may perform reasonably well after further learning of the current task.


Introduction
Intelligent systems must continue to adapt to the changing environment to behave suitably in the real world.An agent needs to remember previously learned experiences while 1/22 arXiv:2105.07385v1[stat.ML] 16 May 2021 adapting to new knowledge.Continual learning is a framework to train an agent under a sequence of different tasks to achieve such intelligent behavior. 1,2) ne of the main challenges in continual learning for neural networks is called catastrophic forgetting.5] A variety of methods have been developed to avoid catastrophic forgetting in training a neural network.The proposed methods can be broadly classified 1,2) into approaches such as regularization approaches, 6,7) dynamic architectural updates, 8,9) and memory reply. 10,11) ost of the previous methods for avoiding catastrophic forgetting are based on intuitive insights into the phenomenon, and their performances have been evaluated by numerical experiments using benchmark datasets.Therefore, there is room for consideration in the theoretical evaluation of these proposed methods.If we have a theoretical framework for understanding the phenomenon of catastrophic forgetting, then we can evaluate existing methods with the same framework.Additionally, by understanding catastrophic forgetting, we may be able to propose theoretically suitable methods within the framework.For these reasons, the purpose of this study is to provide a theoretical framework for the phenomenon of catastrophic forgetting.
In this study, we provide the theoretical framework to analyze catastrophic forgetting by using teacher-student learning. 12)Teacher-student learning is a framework in which we introduce a neural network called the teacher network as a target function in supervised learning, and the learning neural network is called the student network.The student network learns from the difference between the teacher's output and the student's output.][14][15] For simplicity, we consider analyzing a singlelayer linear neural network as the model, which learns two tasks.We analyze one of the most fundamental learning rules, called stochastic gradient descent (SGD).
To analyze continual learning in the teacher-student framework, we need to model the similarity between tasks.In this study, we introduce the following two similarity measures: the similarity between input distributions (input space similarity) and the similarity between the input-output relationships of teacher networks (weight space similarity).][17] Additionally, since the task in supervised learning is formulated as learning the input-output relationship under a specific input distribution, it is reasonable to introduce these two similarities.Specif-ically, we assume that the input distribution is distributed only in a certain subspace of the input space and use the size of the overlap in that subspace as the input space similarity.We define the weight space similarity as the inner product of the teacher's weight representing the true input-output relationship of each task.
Several researchers 15,17) have theoretically analyzed continual learning.Biehl et al. 15) provided a teacher-student learning framework by using statistical mechanics for analyzing learning dynamics when the task changed continuously.Our study is different from theirs in that they studied how well student networks adapt to changing tasks, while we focus on the catastrophic forgetting of previous experiences.Bennani et al. 17) provided a framework for studying continual learning in the neural tangent kernel (NTK) 18) regime.While the framework of Bennani et al. provided upper bounds for generalization error for a wide range of models and inputs, our framework provides an analytical solution for the generalization error for a specific input and model.
We theoretically analyze the abovementioned setup for continual learning.Based on the analysis, we found that the network can avoid catastrophic forgetting when input space similarity is small and when weight space similarity is large.In other words, the student network can remember the first task relatively well when the similarity of input distributions is small and when the similarity of the input-output relationship is large.In addition, a characteristic phenomenon called overshoot was observed as a behavior of catastrophic forgetting.Overshoot is the phenomenon in which once an intelligent system learns the current task, it largely forgets the previous task, while continuing to learn the current task makes the student network remember some of the previous forgotten tasks.This phenomenon suggests that even if a system undergoes catastrophic forgetting, it may still exhibit better performance if it learns the current task for a longer time.

Model
In this section, we formulate the continual learning problem with a teacher-student framework, 12) in which a student network aims to learn an input-output relationship realized by a teacher network.In particular, we focus on the case where the student network learns two tasks in sequential fashion.We then analyze the qualitative performance of the student network by introducing two types of similarity measures between tasks: input space similarity and weight space similarity.

Teacher-student framework
First, we introduce a teacher-student framework.We consider a regression problem, in which a learning task for the student network is utilized to mimic the true function (teacher network) under a certain input distribution.In this study, therefore, we introduce two teacher networks as target functions and two input distributions.
We consider that a single-layer linear neural network learns two tasks sequentially, as illustrated in Fig. 1.The student network receives N-dimensional input data The subscript of each variable (e.g., x v , v ∈ {1, 2}) indicates the task index.The learning task for the student network is to mimic the output of a teacher network t v = B T v x v under a certain input distribution x v ∼ P(x v ).We refer to task 1 as a pair of teacher weight and input distribution (B 1 , P(x 1 )) and task 2 as a pair (B 2 , P(x 2 )).For simplicity, we assume that each element of input data x i v is independently sampled from the normal distribution, , where I v characterizes the data distributions of P(x v ).We will define I v in detail in Section 2.2.
In this study, we focus on the on-line learning setting.In the on-line framework, every time input data x v are given, we update student weight J: input data x v will never again be used for learning.Since the input used for learning is discarded and the previous and next inputs are statistically independent, student weight J is independent of new input data x v .We update student weight J to minimize the error between its output and the output of the teacher network.We use the squared error function v = 1 2 (t v − s v ) 2 , which is most commonly used for regression.To optimize the student's weights, we use stochastic gradient descent (SGD).
The update amount of the weights with SGD while learning task 1 is written as follows: and the update amount of the weights with SGD while learning task 2 is written as follows: in which we set the learning rate as η N so that our macroscopic system is N-independent.For simplicity of calculation, we assume that the weights of the initial student network J 0 and teacher networks B 1 and B 2 are sampled independently from the normal distribution: ).To evaluate model performance, we focus on a metric called generalization error for each task.The generalization errors for task 1 ( g1 ) and task 2 ( g2 ) are defined as follows:

Input space similarity
In this study, as discussed in Section 2.1, we introduced two input distributions, P(x 1 ) and P(x 2 ).In this section, we model these input distributions to evaluate the relationship between inputs for two tasks.We characterize the difference between P(x 1 ) and P(x 2 ) by the difference in the input space, that is, I 1 and I 2 which were defined in 2.1.Here, we introduce a parameter, r, as the proportion of the subspace in which each input has a nonzero value from the total input space to define I 1 and I 2 in a simple way.As we'll see in more detail later, setting r determines the proportion of the common space of x 1 and x 2 .For simplicity of calculation, we set I 1 = I 1:rN and I 2 = I (1−r)N:N .I i: j indicates a matrix in which the values of diagonal components from the i-th to the j-th dimensions are ones, and the rest are zeros.Fig. 2 is a schematic diagram of the input space of task 1 and task 2. Gray nodes have nonzero < l a t e x i t s h a 1 _ b a s e 6 4 = " M O q X 8 F / s y c a N o l u 9 L V q j h N 6 J J g z e y D P + A i q w g i I Z C t 7 t 3 4 A y 7 8 B H F p w I 0 L 7 / Q 0 h C i J t 6 i q U 6 f u u X W q y g o c F W m i q z b j x c v 2 j s 6 u 7 k k p w h m N 3 l s c q r j Z T 1 e N 2 s G S V q m 0 9 x u I e s z G G c L u k H 3 d I F / a R r u v 9 n r T i p 0 f S y x 7 P V 0 s q g M n A w v H L 3 r M r l W W P n j + q / n j W 2 M Z N 4 V e w 9 S J j m L e y W v r 5 / e L v y Y X k 8 n q A j u m H / 3 + m K z v k G X v 2 3 f b w k l 7 8 h w x 9 g P n 7 u p 2 C 1 W D C n C u + X i v n Z + f Q r u j C C U U z y e 0 9 j F h + x i B K f + x W / c I o z o 9 t 4 a 7 w z Z l q p R l u q G c J f Y c w 9 A I P m l c E = < / l a t e x i t > ⇠ N (0, 2 1 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " M O q X 8 F / s y c a N o l u 9 L V q j h N 6 J J g z e y D P + A i q w g i I Z C t 7 t 3 4 A y 7 8 B H F p w I 0 L 7 / Q 0 h C i J t 6 i q U 6 f u u X W q y g o c F W m i q z b j x c v 2 j s 6 u 7 < l a t e x i t s h a 1 _ b a s e 6 4 = " n 2 w 7 e Q 4 p + w + Y q x < / l a t e x i t > = 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " U Z j 7 J i H D / v i s E 3 3 o o a J Z h O e u q 4 g p D N 0 X B 0 z 1 D r N u O U G q q I U r q 3 l J 7 v 9 Q Q j q t b 5 p p 3 Y I u t m l I 1 9 Y q u K R 5 T p d R + O b M t p 8 q x J K X J j 8 R P I A c g i S B W r N g V N r E D C x r q q E H A h M f Y g A K X 2 w Z k E G z m t t B k z m G k + / s C R 4 i w t s 5 Z g j M U Z v d 4 r P J q I 2 B N X r d r u r 5 a 4 1 M M 7 g 4 r E 5 i i e 7 q m F 7 q j G 3 q k 9 1 9 r N f 0 a b S 8 H P K s d r b D L 0 e O x / N u / q h r P H n Y / V X 9 6 9 l D B g u 9 V Z + + 2 z 7 R v o X X 0 j c P T l / x i b q o 5 T R f 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " I 1 9 P a y u 5 1 5 3 d I j v f 9 a q x H W C L w c 8 K y 2 t M I p D x 6 P r r 7 9 q z J 5 9 r H 7 q f r T s 4 8 K 5 k O v O n t 3 Q i a 4 h d b S 1 w 9 P X 1 Y X V i Y a k 3 R J T + z / g p p 0 x z e w 6 q / a V V 6 s n C P O H y B / f + 6 f o D i d l W e z l J 9 O L 8 5 E X 9 G N J M Y x x e 8 9 h 0 U s I 4 d C 6 O 4 E Z z i P P U v D 0 p i U b K V K s U g z g i 8 h Z T 4 A p + K L k A = = < / l a t e x i t > x 1 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " r 2 P Q c t 4 2 g H 0 / S C x J 3 9 a A F 3 c 8 3 / S E 9 3 z D a z q q 3 6 9 L j Y u E O M P U L 4 / 9 0 + w n U k r c 2 l a z y S X Z s O v i G I c E 5 j m 9 5 7 H E l a w h i y f 6 + I c l 7 i K N K U x S Z Y m 2 q l S J N S M 4 E t I M x 8 K R I 3 5 < / l a t e x i t > because of the similarity in their shapes.In other words, the ratio r of the common space is close to 1.Moreover, the colored pixels in the dataset of the number "1" and those of the number "8" are not similar, so the number of colored pixels in common is estimated to be small.In such a situation, the ratio r of the common space is close to 0.5.Based on these observations, the proposed modeling of the data distribution is reasonable and reflects a realworld situation.

Weight space similarity
As shown in Section 2.1, a task is characterized by the input distribution and a teacher network.Therefore, the similarity between tasks will be determined by the input space similarity and the similarity of the teachers.Since input space similarity was defined in Section 2.2, we will define the similarity of teachers in this section.
In this study, model learning task 1 is equivalent to student weight J approaching teacher weight B 1 under the data distribution P(x 1 ), and model learning task 2 is equivalent to student weight J approaching teacher weight B 2 under the data distribution P(x 2 ).Therefore, we define the similarity of teachers by using B 1 and B 2 as follows: we call the parameter q weight space similarity.When there is no correlation at all between tasks 1 and 2, q = 0, and when tasks 1 and 2 are completely similar, q = 1.

Theory
In this section, we show the analytical solution for the generalization error under a continual learning setting.The generalization errors shown in ( 5) and ( 6) can be rewritten as follows: The angle bracket • expresses the expectation for input distribution.Here, we introduce order parameters to capture the state of the system macroscopically: learning task 1 learning task 2   the theory we have derived in (53), ( 66) and ( 90) is valid.
As can be determined from Figs. 3(a) and 3(b), two different behaviors can be observed depending on the parameter settings.In Fig. 3(a), the generalization error approaches the limit of g1 from below.In Fig. 3(b), the generalization error greatly exceeds the limit of g1 once and then converges.We refer to the phenomenon in Fig. 3(b) as overshoot.In Section 4.2, we will discuss overshoot in detail.We also refer to the limit of the generalization error of task 1 after learning task 2 as forgetting value g1 .This is because g1 approached 0 immediately after the completion of task 1 learning but rose to g1 after the completion of task 2 learning, indicating that task 1 was forgotten by learning task 2. We will discuss the forgetting value in Section 4.1.

Forgetting value
In this section, we discuss the forgetting value.The forgetting value is the limit of the generalization error of task 1 while learning task 2, so we can obtain forgetting value g1 in (90): To determine the contributions of input space similarity r and weight space similarity q to g1 , we generate a heat map of forgetting values in Fig. 4. We numerically calculated the forgetting value according to various r and q.Fig. 4 visualizes these results with a heat map for N = 1500, σ 2 B1 = σ 2 B2 = σ 2 J = 1.The darker color represents smaller forgetting values g1 .
This heat map shows that when the input space similarity r is small and the weight space similarity q is large, the forgetting value is small.This means that the student network remembers task 1 better if q is large and r is small.In other words, the student network can remember task 1 relatively well when two teacher networks are similar and two inputs are well sepa-15/22 rated.We also found that input space similarity r and weight space similarity q affect the generalization error through multiplication and have an inverse effect on the forgetting value.
From (90), we also know that the student network remembers task 1 well if σ 2 1 is small, that is, when similar inputs for task 1 are given.Moreover, the input for task 2 is not related to the forgetting value and contributes significantly to the behavior of catastrophic forgetting.This means that the forgetting value is dependent on the characteristics of the input for task 1, while the input for task 2 is one of the factors governing the behavior of catastrophic forgetting.

Overshoot
As mentioned in Section 1, it is important to understand the behavior of catastrophic forgetting.In this section, we discuss overshoot as a characteristic behavior of catastrophic forgetting.Overshoot is a phenomenon in which the generalization error of task 1 while learning task 2 greatly exceeds the forgetting value once and then converges.Therefore, overshoot can be interpreted as a phenomenon in which the model forgets task 1 by learning task 2, while the model learns task 1 again through the learning of task 2.
Here, we discuss the condition in which overshoot occurs qualitatively.First, we consider obtaining the condition for the overshoot analytically.The first step is to find the time at which the time derivative of the generalization error g1 shown in (90) becomes zero.The time derivative of the generalization error g1 can be written as follows: However, as can be seen from the equation, the multiplication and summation of exp t(rησ 2 2 ) 2 , exp −2(rησ 2 2 )t , and exp −(rησ 2 2 )t are mixed up, making it difficult to find an analytical solution.Therefore, we derived the conditions under which overshoot occurs by comparing the coefficients of each term.
The generalization error of task 1 while learning task 2 shown in (90) consists of a sum of terms in i C i exp(−α i t).The smaller the α i is, the slower the decay and the longer the contribution to the generalization error remains.Therefore, considering the signs of the coefficients of the terms with the smallest α i , we can determine whether overshoot occurs.Overshoot occurs when the coefficient of the term with the smallest α i is positive.This is because g1 converges to the forgetting value from a value greater than g1 .Conversely, overshoot may not occur when the coefficient of the term with the smallest α i is negative.This is because g1 converges to a destination from a value smaller than g1 .The smallest α i is determined from the setting of η, r, and σ 2 2 , and the specific conditions for these parameters regarding the occurrence of overshoot are as follows: If ηrσ 2 2 exceeds two, then training diverges.Note that when ηrσ 2 2 = 1, the dominating coefficient C i corresponds to C 2 − 2C 1 , and its sign depends on the other hyperparameters.
Therefore, under condition ηrσ 2 2 = 1, overshoot occurs only when C 2 − 2C 1 > 0 and does not occur when C 2 − 2C 1 < 0. Additionally, note that when 0 < ηrσ 2 2 < 1, overshoot "may not" occur.This is because even if the coefficient C i on the smallest term of α i is negative, there can be two extrema in the time development of the generalization error for task 1, and thus, overshoot can occur.

Effects of hyperparameters on catastrophic forgetting
For practical applications, we consider whether it is possible to modify the update rule of the Stochastic Gradient Descent (SGD) to avoid catastrophic forgetting.In the setting of our study, we know that learning converges in the range 0 < rησ 2 2 < 2. In this section, we will limit our discussion within this range.In the case of SGD, the only parameter that can be adjusted is η, and adjusting η does not change the amount of the forgetting value showed in (91).However, the speed of forgetting shown in (92) depends linearly on ηrσ 2 2 .This means that the speed of forgetting can be reduced by making η smaller.On the other hand, there are proposals to improve the learning algorithms themselves, such as Elastic Weight Consolidation (EWC) proposed by Kirkpatrick et al., 7) so there is a possibility that the framework proposed in this paper can be used to analyze these algorithms in the future.

Overparametrization
In cutting-edge theoretical research, there is interest in situations where the number of student's parameters is greater than the number of teacher's parameters.We call this situation "overparametrization." In this section, we discuss such a setting.In this paper, both the student and the teacher networks are single-layer linear neural networks, so considering the a general one.In the hidden manifold model, the input is assumed to be sampled from a low-dimensional manifold rather than a Gaussian distribution.The dimension of the manifold is called the intrinsic dimension.As explained above, in the hidden manifold model, the generalization error of the convergence destination no longer depends on the learning rate or the difference between student and teacher parameters.Instead, the ratio of the input dimension to the intrinsic dimension influences the generalization error.We will discuss possible influences of this consequence on our two findings.One finding is that the forgetting value is characterized by the weight space similarity and the input space similarity.The other is the phenomenon called overshoot.The following is a discussion of how these findings may change when the inputs are extended to non-Gaussian in a setting such as the hidden manifold model described above.
First, let us consider the possible impact on the first finding.Input space similarity quantifies the degree to which the input data overlap in the input dimension.As mentioned above, in a hidden manifold model, the intrinsic dimension of the manifold that generates the data is more important than the apparent input dimension.Therefore, in addition to the input space similarity, the similarity of the intrinsic dimensions of the manifold may be important.
Next, we consider the effect on the second finding, that of overshoot.Goldt et al. 20) reported that learning proceeds more quickly without being trapped by poor local solutions under the assumption of a hidden manifold model.An overshoot is a phenomenon where the learning rate is too large and learning proceeds too fast, temporarily increasing the generalization error of old tasks than that at convergence destination.If we assume that learning is also accelerated in continual learning by making the input distribution non-Gaussian, this is equivalent to a larger learning rate in the execution, and overshoot may become more pronounced.In this case, the conditions under which overshoot occurs may be modified.

Conclusions
In this study, we provided a theoretical framework for analyzing catastrophic forgetting by using teacher-student learning.We theoretically analyzed the generalization error of the student network for continual learning of two tasks.We found that input space similarity and weight space similarity exerted significant effects on catastrophic forgetting.The student network can remember the first task relatively well when the similarity of the input distribution is small and when the similarity of the input-output relationship is large.We also find that a characteristic phenomenon, which we refer to as overshoot, is observed under the specific hyperparameter conditions.This phenomenon suggests that even if a system undergoes catastrophic forgetting, it may still exhibit better performance if it learns the current task for a longer period of time.

Fig. 1 .
Fig. 1.Schematic diagram of the continual learning problem in a teacher-student setup.The student network learns task 1 first and then learns task 2. The upper figure shows the ideal behavior of the generalization error for tasks 1 and 2.
e q a P X 2 s 1 v R o t L w 2 e l b Z W W O W R 4 + j a + 7 + q K s 8 u 9 r 9 U f 3 p 2 s Y t 5 z 6 v G 3 i 2 P a d 1 C b e v r h 6 e v a 4 v 5 R H O a r u i F / V / S E 9 3 z D Y z 6 m 3 q d E / k L h P g D U j + f u x M U 0 8 n U b H I h l 4 5 n l v 2 v C G I C k 5 j h 9 5 5 D B q v I o s D n H u A E Z z g P v E g R K S q N t 1 O l g K 8 Z w 7 e Q 4 p + w + Y q x < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " v F v w I D b F e P s X V L

s a 3 A
/ n 9 o o j y j y m O y O y n N H g / l Y n J I U h P w T K C G I I 4 y M H a t g D Z u w o a O I A g Q s+ I x N q P C 4 r U I B w W F u H W X m X E Z G s C 9 w h C h r i 5 w l O E N l d p f H b V 6 t h q z F 6 2 p N L 1 D r f I r J 3 W W l j A T d 0 z U 9 0 x 3 d 0 C O 9 / 1 qr H N S o e j n g W a t p h Z P v P O 5 d e P 1 X V e D Z x 8 6 n 6 k / P P r Y w F X g 1 2 L s T M N V b 6 D V 9 6 f D k e W F 6 P l E e o g t 6 4 O K 2 N I 0 i P d U o M e 6 I 6 e 6 e P X W n W / R t P L I c 9 K S y v s c u R k b O P 9 X 5 X B s 4 e 9 L 9 W f n j 3 s Y t 7 3 q r F 3 2 2 e a t 1 B b

rN 1 < l a t e x i t s h a 1 _ b a s e 6 4 =
M J j b E C B y 2 0 D M g g 2 c 1 s o M + c w 0 v 1 9 g Q o i r C 1 w l u A M h d k c j 1 l e b Q S s y e t q T d d X a 3 y K w d 1 h 5 S B G 6 J 6 u 6 Y X u 6 I Y e 6 f 3 X W m W / R t V L i W e 1 p h V 2 p u O w d + n t X 1 W e Z w 9 7 n 6 o / P X v Y x b T v V W f v t s 9 U b 6 H V 9 M W D 4 5 e l m c W R 8 i h d 0 B P 7 P 6 c H u u U b m M V X 7 T I l F k 8 Q 4 Q + Q v z / 3 T 7 C S i M u T c U o l Y r M T w V e E 0 Y c h j P N 7 T 2 E W 8 0 g i z e e a O M I p z k L P U o / U L w 3 U U q V Q o O n G l 5 D G P g A v M I 0 1 < / l a t e x i t >x " l I U a K D g m c j y x V O Y s j E N d z j B T 6 9 U

6 5 c 2 Z
G c 0 z D k 0 S P I a W n t 6 9 / Y H A o P D w y O j Y e m Z j M e X b D 1 U V W t 0 3 b L W i q J 0 z D E l l p S F M U H F e o d c 0 U e a 2 2 3 t 7 P N 4 X r G b a 1 I w 8 c U a 6 r e 5 a x a

6 5 c 2 Z
G c 0 z D k 0 S P I a W n t 6 9 / Y H A o P D w y O j Y e m Z j M e X b D 1 U V W t 0 3 b L W i q J 0 z D E l l p S F M U H F e o d c 0 U e a 2 2 3 t 7 P N 4 X r G b a 1 I w 8 c U a 6 r e 5 a x a+ i q Z C o f L 1 V t 6 c U r k R g l y Y 9 o N 0 g F I I Y g 0 n b k G i V U Y U N H A 3 U I W J C M T a j w u B W R A s F h r o w W c y 4 j w 9 8 X O E K Y t Q 3 O E p y h M l v j c Y 9 X x Y C 1 e N 2 u 6 f l q n U 8 x u b u s j C J B D 3 R D r 3 R P t / R E H 7 / W a v k 1 2 l 4 O e N Y 6 W u F U x o + n t 9 / / V d V 5 l t j / U v 3 p W W I X K7 5 X g 7 0 7 P t O + h d 7 R N w 9 P X 7 d X t x K t e b q k Z / Z / Q Y 9 0 x z e w m m / 6 V U Z s n S P M H 5 D 6 + d z d I L e Y T C 0 l K b M Y W 9 s M v m I Q s 5 j D A r / 3 M t a w g T S y v r s T n O E 8 9 K J M K j P K b C d V C Q W a K X w L J f 4 J r P K M J A = = < / l a t e x i t > . . .< l a t e x i t s h a 1 _ b a s e 6 4 = " F 7 o O 1 N W 8 B y

6 5 c 2 Z
G c 0 z D k 0 S P I a W n t 6 9 / Y H A o P D w y O j Y e m Z j M e X b D 1 U V W t 0 3 b L W i q J 0 z D E l l p S F M U H F e o d c 0 U e a 2 2 3 t 7 P N 4 X r G b a 1 I w 8 c U a 6 r e 5 a x a The learning curve of N = 3000, r = 0.8, q = 0.3,σ B1 = σ B1 = σ J = 1, σ 2 1 = σ 2 2 = 0.8.The learning curve of N = 3000, r = 0.8, q = 0.9, σ B1 = σ B1 = 1, σ J = 11,

Fig. 3 .
Fig. 3.The learning curve.Blue lines indicate experimental values, and orange lines indicate theoretical values.In both figures, the experimental and theoretical values overlap very well.Two different behaviors can be observed depending on the parameter settings.We will discuss this phenomenon in Section 4.2.