Introduction

In the past few decades, the huge potential application of neural networks in various fields has led to the rapid development of neural networks [1]. The reaction–diffusion neural network (rdnn) [2, 3] has been studied to reduce the burden of neural networks while ensuring system performance. With the continuous application of neural networks in the field of computer vision, such as image classification [4,5,6], object detection [7,8,9,10], and semantic segmentation [11, 12], these fields perform well in solving problems of offline learning, but it is still challenging to train models to complete a series of tasks in an online manner. Deep learning systems must continuously learn in a dynamic environment [13,14,15,16], and their goal is usually to learn tasks in unknown environments in a task-increasing manner [17, 18].

Catastrophic forgetting [19] is a great challenge for artificial intelligence. As shown in Fig. 1, when a neural network learns a new task, the knowledge of the previously learned task is overwritten, leading to the network model forgetting almost all the information from the previously learned task. To solve the problem of efficient and cumulative learning of new tasks, models can be trained by storing some of the old data jointly with the current data [20,21,22,23]. However, in the real world of infinite data, retaining old data is challenging due to storage constraints or privacy issues. Another approach to avoid catastrophic forgetting is to save a sample of the prior task information in its original format [24, 25] or to use a pseudo-sample generated by the generation model [26, 27] and then reinput the prior task sample to the model while it is learning the current task. However, for complicated networks using complex datasets, such as large generation models such as GANs [28, 29] and autoencoders [30, 31], this is inefficient. Another method is to calculate future changes in important parameters of the original model and add them as penalties to the final training loss. This method can reduce the forgetting of previously learned knowledge by regularizing important parameters [32, 33] or intermediate features [21, 34, 35]. However, at present, these regularization strategies only consider the impact of parameter changes on losses [36, 37] or the model output [38]. According to Y. Hsu and G. M. Van de Ven et al. [39, 40], such strategies preserve old class information to some extent but also limit the model's learning of new knowledge, often yielding a situation in which neither new knowledge nor knowledge related to previous tasks can be preserved.

Fig. 1
figure 1

Catastrophic forgetting of machine learning

To enable neural networks to continuously adapt to new tasks and maintain previous knowledge, we propose a continuous learning method with Bayesian parameter updating and weight memory (CL-BPUWM). Our method is considered from two aspects. First, the neural network parameters must be able to change after the deep neural network model has been trained; otherwise, they will not accurately reflect the current features of the new image. We present a Bayesian-based parameter update approach based on estimating the Fisher information matrix of the current parameters as the importance of the current parameters. Second, we suggest calculating the importance weight by evaluating the sensitivity of network parameter changes to the model's prediction function because even small changes in network parameters could cause the model to exhibit catastrophic forgetting. The main contributions of our work are as follows:

  • To better learn and update the neural network parameters, we propose a simple and effective parameter update method based on the Bayesian criterion, select the most likely category as the final discriminant of the sample, and introduce the diagonal Fisher information matrix to greatly reduce the amount of computation and improve the parameter update efficiency.

  • Different from previous methods based on storing data and generating pseudo-samples, we emphasize the forgetting phenomenon of old tasks in continuous learning and propose to convert the sensitivity of the model prediction function into the weight of computational importance by observing the change in network parameters to better learn and constrain the old task-specific knowledge.

  • We validated our algorithm on CIFAR-100, CIFAR-10, and Split MNIST datasets, significantly improving classification accuracy compared to other existing continuous learning-based algorithms. In addition, the amount of parameter calculation is reduced.

Related work

This section mainly reviews the current research status of Bayesian continuous learning methods, regularized continuous learning methods and parameter importance calculation continuous learning methods related to our research.

Continuous learning method based on Bayesian estimation

To allow deep learning models to continuously learn information about new tasks as new task inputs are presented, the deep neural network must update the learned parameters in real time. Broderick et al. [41] use sequential Bayesian to allow topic models to learn from streaming data and to stream updates to the estimated posterior based on user-specified approximate batch primitives. Similarly, Huang et al. [42] used a sequential Bayesian method to adapt the parameters of the context-dependent deep neural network hidden Markov model (CD-DNNHMM) to specific users in the field of speech recognition. By adding prior knowledge into the parameter update process, a maximum posterior estimation of the parameters of a linear input network can be used as the foundation of the adaptive framework. Rashwan et al. [43] used Bayesian moment matching in sum–product networks to gradually match the posterior distribution moments with three transfer learning techniques: weight transfer, L2 norm of new and previous parameters, and dropout variant of previous parameters. This method made the Gaussian distribution assumption for neural networks reasonable.

However, using sequential Bayesian estimation is a step-by-step learning process that updates only a small number of parameters per iteration, especially in the initial phase, which can lead to slower learning. This approach typically requires considerable computation for large-scale complex models, requiring the full Bayesian posterior probability distribution to be recalculated every time the parameters are updated, which can be very time consuming.

Inspired by the above work and Kirkpatrick et al. [36], CL-BPUWM is also a continuous learning approach based on Bayesian estimation. To make neural networks learn new tasks more quickly and effectively, we propose using Bayesian estimation to update the parameters of neural networks during continuous learning. The a posteriori knowledge learned during training on the previous task is used to update the prior knowledge of the new task data, and the resulting computed prior knowledge is used to learn the posterior knowledge of the new task with Bayesian estimation. Compared with the traditional Bayesian method, the new task parameter processing capability of this method is improved.

Continuous learning method based on regularization

The regularization-based work avoids storing raw inputs and prioritizes privacy. It limits the size of the model parameters by introducing regularized terms into the loss function, thereby preventing overfitting and improving generalization ability. Researchers have proposed a variety of regularized ways to mitigate catastrophic forgetting in continuous learning. Lange, M.D., et al. [44] proposed that the model parameter distribution should be estimated by focusing on priors. Knowledge learned from new data should be used as priors, and changes to parameters in subsequent tasks would be punished. In a data-centric approach, the basic building block is the distillation of knowledge from previous models (trained on previous tasks) to models trained on new data [17]. LwF is the use of knowledge distillation technology to perform implicit regularization [45, 46]. LwF uses pseudo-training data for the previous task, then uses the training data of the new task as an input into the old network before learning the new task, and finally prevents catastrophic forgetting by optimizing the pseudo-training data for the previous task and the real data for the new task.

Many empirical studies have shown that rule-based approaches have a beneficial effect in mitigating catastrophic forgetting [18, 19]. However, the regularization-based CL method constructs a regularization-based term as an approximate loss function of previous tasks to limit the updating of parameters, which may cause the network parameters to not reflect the current new tasks in a timely manner. Most of the methods are based on heuristics, with no good theoretical understanding of the factors associated with catastrophic forgetting [47].

To make the neural network efficiently learn new task data, the possibility of unknown events is predicted by probability means from past statistical data, so that it can fit well with incremental learning to improve the model's classification of new task knowledge.

Continuous learning method based on parameter importance calculation

As a neural network learns a particular task, it is often necessary to adjust the learned parameters to allow the network to complete the learning of that task. However, when learning a new task, the change in parameters will cause the neural network to forget the previously learned parameters almost completely, causing the network model to perform worse in past tasks. At present, there are many methods to determine which parameters have an important impact on the performance of the model by calculating the importance of each parameter and to carry out continuous learning and adjustment of the model according to this important information [36,37,38, 45, 47, 48]. For example, the elastic weight integration (EWC) [36] measures the sensitivity of the parameters relative to each task through the Fisher information matrix of the parameters calculated by KL divergence and indicates which parameters need to be retained most to avoid forgetting old tasks. Mazur et al. [48] proposed the idea of CW-TaLaR, similar to EWC, using the Cremer–Will distance (instead of the KL divergence) to calculate the penalty term directly. Synaptic intelligence (SI) [37] calculates the path integral of the motion trajectory of parameters in the training process and takes the absolute integral as an indicator to measure the importance of parameters, which can be represented by the local contribution of each parameter to the overall loss change in the training process. Memory Aware Synapses [38] approximates the importance of parameters based on the gradient of the square norm of the output of the learning function.

In practice, however, the EWC mentioned in the appeal requires calculating appropriate weights for all neural network parameters (determining their importance), and the mechanism of this approach is derived from more complex methods that are computationally expensive. The MAS framework, compared to EWC and SI, avoids the complexity caused by local minima of the loss function. However, it ignores the fact that the network parameters are only agents of the prediction function of the neural network.

The CL-BPUWM method reduces catastrophic forgetting of past task information while allowing the deep model to continuously change its parameters to better perform the present task. Inspired by MAS [38] and considering that small changes in network parameters may lead to changes in the final output for old tasks, this paper designs a sensitivity function that calculates the impact of network parameter changes on model output and considers parameters with large output changes as very important parameters for the current task. In the subsequent training process, to alleviate the forgetting of previous task knowledge, important parameters are added to the final loss in the form of regularized items to prevent them from changing.

Methods

To enable neural networks to learn continuously, CL-BPUWM, such as EWC [36] and MAS [38], limits the difference between new task parameters and the previously learned parameters as much as possible when learning the current task parameters to avoid catastrophic forgetting of the previous task. Unlike EWC, we do not store the Fisher information matrix for each task individually but instead obtain the final Fisher information matrix by efficiently updating the Fisher for the next task using the average. Unlike MAS, we do not approximate the importance directly by the gradient of the model output relative to the model parameters but account for how the parameters change at each step.

The CL-BPUWM method consists of two parts. The first part updates the parameters based on the Bayesian criterion and converts them into a Fisher information matrix to determine the parameter importance. The second part calculates, after training the current task, the degree of influence of changes in each parameter \(\theta_{i}\) in the model on the output of the model, yielding the importance \(M_{i}\) (importance weight) for the current task; the parameters with greater influence are retained and carried forward to the subsequent tasks [38].

Parameter update strategy based on the Bayesian criterion

To calculate parameter importance and update parameters, we use the basic framework of the EWC method [36]. When the parameters of the new task k + 1 are trained, EWC uses a regularization method to constrain the important parameters of the previous task k and prevent the important parameters of task k from changing, thereby avoiding the catastrophic forgetting of the important parameters of the previous task k during the training of the new task k + 1.

Given the optimal parameter \(\theta_{k}^{*}\) of the data \(D_{1:k}\) from the previous k tasks, the new data \(D_{k + 1}\) are used to learn to optimize the parameter \(\theta_{k + 1}^{*}\) for the new task k + 1. Considering the training of the neural network from a probabilistic perspective, optimizing the parameters is equivalent to making the model find the most likely value \(\theta = \theta_{k + 1}^{*}\) in tasks 1 to k + 1 without the help of past data \({D}_{1:k}\) such that the posterior of all data presented to the model thus far is maximized, \(\theta_{k + 1}^{*} \leftarrow \arg \mathop {\max }\limits_{\theta } \log p\left( {\theta \left| {D_{1:k + 1} } \right.} \right)\). Using Bayesian estimates, the right-hand part can be expanded as in Eq. (1).

$$ \begin{aligned} \log p\left( {\theta \left| {D_{1:k + 1} } \right.} \right)& = \log p\left( {\left. {D_{k + 1} } \right|\theta } \right) + \log p\left( {\theta |D_{1:k} } \right)\\ & \quad - \log p\left( {D_{k + 1} |D_{1:k} } \right), \end{aligned} $$
(1)

where \(\log p\left( {\left. {D_{k + 1} } \right|\theta } \right)\) represents the log-likelihood of the new task k + 1 and is the probability that the new data \(D_{k + 1}\) will occur, i.e., \(\log p\left( {\left. {D_{k + 1} } \right|\theta } \right) = - L_{k + 1} (\theta )\), \(\log p(\theta |D_{1:k} )\) represents the posterior of task 1 to task k, and \(\log p(D_{k + 1} |D_{1:k} )\) is a constant and not related to the parameter \(\theta\) to be optimized. Therefore, only the first two terms need to be optimized.

The first term can be expanded as

$$ \arg \mathop {\max }\limits_{\theta } \log p\left( {\theta \left| {D_{1:k + 1} } \right.} \right) = \arg \mathop {\max }\limits_{\theta } \left\{ { - L_{k + 1} (\theta ) + \log p(\theta |D_{1:k} )} \right\} + C, $$
(2)

where C represents a constant and the posterior \(\log p(\theta |D_{1:k} )\) is difficult to obtain without access to past data \(D_{1:k}\). Using the Laplace diagonal approximation to estimate the Gaussian posterior and a Taylor expansion at the first order pole of the first k previous task optimal parameters \(\theta_{k}^{*}\) while ignoring terms after the third power obtains:

$$ \begin{aligned} \log p\left( {\theta |D_{1:k} } \right)& \approx \log p\left( {\theta_{k}^{*} |D_{1:k} } \right) \\ & \quad - \frac{1}{2}\frac{{\partial^{2} \log p\left( {\theta_{k}^{*} |D_{1:k} } \right)}}{{\partial \theta_{k}^{*2} }}\left( {\theta - \theta_{k}^{*} } \right)^{2} , \end{aligned} $$
(3)

where \(\theta_{k}^{*}\) is the optimal parameter given the previous task data and \(\frac{{\partial^{2} \log p(\theta_{k}^{*} |D_{1:k} )}}{{\partial \theta_{k}^{*2} }}\) is the Hessian matrix \({\rm H}\left( {\log p(\theta |D_{1:k} )} \right)\) of \(\log p(\theta_{k}^{*} |D_{1:k} )\). Since the Fisher information matrix is equal to the negative expectation of the Hessian matrix and assuming that it is a diagonal matrix, denoted by F, then:

$$ \begin{gathered} F = - \sum\limits_{i} {E\left[ H \right]} \hfill \\ = - \sum\limits_{i} {E_{{p(\theta |D_{1:k} )}} \left[ {\left( {\frac{{\partial \log p(\theta |D_{1:k} )}}{\partial \theta }|_{{\theta = \theta_{k}^{*} }} } \right)^{2} } \right]} \hfill \\ = - \sum\limits_{i} {E_{{p(\theta |D_{1:k} )}} \left[ {\frac{{\partial^{2} \log p(\theta_{k}^{*} |D_{1:k} )}}{{\partial \theta_{k}^{*2} }}} \right]} \hfill \\ \end{gathered} $$
(4)

Substituting Eqs. (3) and (4) into Eq. (2) gives:

$$ \arg \mathop {\max }\limits_{\theta } \log p\left( {\theta \left| {D_{1:k + 1} } \right.} \right) \approx \arg \mathop {\max }\limits_{\theta } \left\{ { - L_{k + 1} (\theta ) - \frac{\lambda }{2}\sum\limits_{i} {F_{i}^{k} } \left( {\theta - \theta_{k}^{*} } \right)^{2} } \right\} + C^{\prime}, $$
(5)

where \(F_{i}^{k}\) is the Fisher information matrix of all parameters \(\theta_{i}\) in the old task. Assuming that F is diagonal, the number of parameters can be reduced using the diagonal term \(F_{i}^{k}\) of the optimal parameter \(\theta_{k}^{*}\) of the previous task. The diagonal element contains the first-order derivative of the neural network output for the neural network parameters, reflecting the importance of the neural network parameters to the previous task dataset. Defining it as the second-order derivative of the log probability is key to understanding its role in preventing forgetting. \(C{\prime}\) denotes the constant, \(C{\prime} = C + \log p(\theta_{k}^{*} |D_{1:k} )\); \(\lambda\) is the hyperparameter, and it is used to measure the importance of the previous task data.

Extracting the negative sign from the right-hand term of Eq. (5) and updating the optimal parameters of the current task k + 1 yields:

$$ \theta_{k + 1}^{*} \leftarrow \arg \mathop {\min }\limits_{\theta } \left\{ {L_{k + 1} (\theta ) + \frac{\lambda }{2}\sum\limits_{i} {F_{i}^{k} } \left( {\theta - \theta_{k}^{*} } \right)^{2} } \right\}. $$
(6)

Equation (6) relies only on the data of the current task to be trained \(D_{k + 1}\). Previous task data and information are encapsulated in the second term of the quadratic penalty, allowing the model parameter \(\theta\) to not deviate substantially from the previously learned optimal parameter \(\theta_{k}^{*}\).

However, EWC requires independent storage of Fisher information for each task and joint regularization of all tasks so that the parameters are not constrained such that they deviate too much from the optimized parameters. When scaling continuous learning to many tasks, this causes the number of regularized items to grow linearly as the number of tasks increases. To estimate empirical Fisher information matrices, the EWC method requires additional computations for each task dataset. If there are many tasks and the network has millions of parameters, the calculation method is infeasible. To this end, we introduce a new improvement, such as that presented in a previous publication [49]. We first calculate the Fisher information matrix for the current task, treat it as an importance score and record it. When calculating the Fisher information matrix of the next task parameters, the average update method is used to obtain the final Fisher information matrix efficiently:

$$ F_{\theta }^{{t_{n} }} = \gamma F_{\theta }^{{t_{n} }} + \eta F_{\theta }^{{t_{n - 1} }} , $$
(7)

where \(F_{\theta }^{{t_{n} }}\) is the Fisher information matrix obtained using the current task parameters, \(t_{n}\) represents the training iteration, \(\gamma ,\eta \in \left( {0,1} \right)\) is the hyperparameter, and \(\gamma + \eta = 1\). In the experiment below, \(\gamma = 0.4\), \(\eta = 0.6\). The Fisher information matrix calculated by introducing the Fisher information matrix approach contains information about previous tasks and eliminates additional forward–backward transfer on the dataset. At the end of each task, the Fisher information matrix \(F_{\theta }^{{t_{n} }}\) of the last iteration of the current task is stored, replacing the Fisher information matrix \(F_{\theta }^{{t_{n - 1} }}\) from the previous iteration \(t_{n - 1}\), and used to regularize the next task. During the whole training process, we only need to store two sets of Fisher information matrices, independent of the number of tasks.

Calculation of parameter importance

We first computed the Fisher information matrix for the previous task and, when training the next task, for the current task using Eq. (7). However, since the Fisher information matrix captures only the intrinsic properties of the model in its minimum value, we also need to compute the importance weight of each parameter, indicating its importance relative to the previous task. A sequence of tasks is denoted as \(D \in \{ D_{1} ,......,D_{k - 1} \}\), where \(1, \ldots ,k - 1\) represents the number of tasks and \(D_{1} ,......,D_{k - 1}\) represents the training data corresponding to each task. Each training task has its training data \((x_{i} ,\mathop {y_{i} }\limits^{ \wedge } )\), and these training data consist of a feature vector \(x_{i} \in X\) of the input data and a target vector \(\mathop {y_{i} }\limits^{ \wedge } \in \mathop Y\limits^{ \wedge }\). We start with Task 1 and train the model to minimize task loss \(L_{1}\) on Task 1's training data \((x_{1} ,\mathop {y_{1} }\limits^{ \wedge } )\). When the model converges, the true function F of the model is approximated by the function f of \(x \to y\) based on known data samples, which maps the new input \(x_{1}\) and to an output \(y_{1}\), allowing this approximation function to learn other tasks. Our goal is to estimate an importance weight for each parameter in the network, where parameter importance is defined not as the (inverse) measure of parameter uncertainty in [36] or the sensitivity of loss to parameter change in [37] but as the sensitivity of the learning function \(f\) to parameter change [38].

Given a data point \(x_{k}\), the output of the network model is \(f\left( {x_{k} ;\theta } \right)\). If a small change \(\delta (t)\) in the parameter \(\theta (t)\) at time t will cause a large change in the output of the learning function, then this parameter is defined as an important parameter, and the change in the output of the function can be approximated as:

$$ f\left( {x_{k} ;\theta \left( {t + \Delta t} \right)} \right) - f\left( {x_{k} ;\theta \left( t \right)} \right) = \sum\limits_{i} {g_{i} \left( {x_{k} } \right)\delta_{i} \left( t \right)} , $$
(8)

where \(\delta_{i}\) is a small perturbation of the parameter \(\theta_{i}\), indicating that it causes a change in the output of the function; \(g_{i} \left( t \right) = \frac{\partial f}{{\partial \theta_{i} }}\) is the partial derivative of the learning function for the weight parameter \(\theta_{i}\), indicating to what extent a small perturbation of this parameter will change the output of the learning function for the input data point \(x_{k}\). Starting from Task 1, for each observation data point \(x_{k}\) in the task, to preserve the prediction ability of the model and prevent changes in the parameters that are important for that prediction ability, it is necessary to calculate the influence that each data point has in changing the learning function.

$$ m_{i} = \frac{1}{N}\sum\limits_{k = 1}^{N} {||g_{i} (x_{k} )} ||, $$
(9)

where N is the sum of the input data points in a given direction, and \(g_{i} (x_{k} )\) represents the gradient of the learning function to the weight parameter \(\theta_{i}\).

Unlike [38], we did not approximate the importance directly through the gradient of model output relative to model parameters but rather considered the changes that occur in the parameters at each step. Similar to [50], at each step, we aim to choose a descending direction so that the amount of change caused in our model is reflected by KL divergence, and this parameter change causes a corresponding change in the model distribution. KL divergence can be used to measure the difference in probability distributions after parameter changes, so we define similarity measures between nearby density functions. Assuming a small perturbation \(\delta_{i} \to 0\) in the parameter \(\theta\), the KL divergence is approximated using a second-order Taylor series:

$$\begin{aligned} D_{KL} \left( {p_{\theta } ||p_{\theta + \Delta \theta } } \right) & \approx \, \left( {E\left[ {log \, p_{\theta } } \right] \, - \, E\left[ {log \, p_{\theta } } \right]} \right) \, \\ & \quad - \, E \, \left[ {\nabla \, log \, p_{\theta } } \right] \, \Delta \theta \\ & \quad - \frac{1}{2}\Delta \theta^{T} E\left[ {\nabla^{2} \log p_{\theta } } \right]\Delta \theta \\ &= \, \sum\limits_{i = 1}^{p} {\frac{1}{2}\Delta \theta_{i}^{T} E\left[ { - \nabla^{2} \log p_{{\theta_{i} }} } \right]\Delta \theta_{i} } \\ & = \, \sum\limits_{i = 1}^{p} {\frac{1}{2}\Delta \theta_{i}^{T} \, F_{{\theta_{i} }} \Delta \theta_{i} } \\& = \, \frac{1}{2}\sum\limits_{i = 1}^{p} {F_{{\theta_{i} }} \times \Delta \theta_{i}^{2} ,} \\ \end{aligned}$$
(10)

where \(F_{\theta }\) represents the empirical Fisher information matrix at \(\theta \) defined by Eq. (4). With KL divergence measurements, this direction gives the largest target change per unit change in the model.

In this case, parameter importance is defined as the ratio of the change in the learning function to the distance between the conditional likelihood distributions in the parameter space. The importance of the parameter \(\theta_{i}\) can be calculated as:

$$ M_{t} (\theta_{i} ) = \sum\limits_{{t = t_{1} }}^{{t_{n} }} {\frac{{m_{i} }}{{\frac{1}{2}F_{{\theta_{i} }}^{t} \times \Delta \theta_{i} (t)^{2} + \varepsilon }}} , $$
(11)

where \(m_{i}\) is the average importance obtained by dividing the cumulative importance measure by the total number of data points after constant updating during the training period, as given in Eq. (9); \(F_{{\theta_{i} }}^{t}\) is the Fisher information matrix when the parameter updates at iteration\(t\), calculated by Eq. (7); and \(\Delta \theta_{i} \left( t \right) = \theta_{i} \left( {t + \Delta t} \right) - \theta_{i} \left( t \right)\), \(\varepsilon > 0\) is used to prevent issues when the first part of the denominator is zero. In the experimental part, we set the value to 0.1. As soon as a new data point is entered into the network, the weight parameter importance corresponding to that data point can be calculated online with this equation.

Objective function

figure a

Algorithm: A continuous learning algorithm for Bayesian parameter update and weight memory

When learning a new task, to avoid catastrophic forgetting of the previous task, the objective function has a regularization loss term in addition to the loss \(L^{k}\) of the current task to penalize changes in parameters that are important for the previous task. This regularization term is based on the combination of the importance from the Fisher information matrix and the sensitivity of the learning function:

$$ \mathop {L^{k} }\limits^{\sim } \left( \theta \right) \, = \, L^{k} \left( \theta \right) \, + \, \lambda \sum\limits_{i} {\left( {F_{{\theta_{i}^{k - 1} }} + M_{t} \left( {\theta_{i} } \right)} \right)\left( {\theta_{i} - \theta_{i}^{k - 1} } \right)^{2} } , $$
(12)

where \(F_{{\theta_{i}^{k - 1} }}\) is the Fisher information matrix of the last iteration parameter \(\theta_{i}\) of task k – 1, and \(M_{t} (\theta_{i} )\) is the cumulative importance of \(\mathop t\nolimits_{n}\) from the first training iteration \(t_{0}\) to the last training iteration corresponding to task k – 1. The hyperparameter \(\lambda \in [0,1][0,1]\) is used to ensure that \(F_{{\theta_{i}^{k - 1} }}\) and \(M_{t} (\theta_{i} )\) are scaled to the same order of magnitude, preserving both of their influences. \({\theta }_{i}\) is the current weight to be trained, and \(\theta_{i}^{k - 1}\) is the weight corresponding to the parameter at the end of task k – 1. Finally, CL-BPUWM is combined with the brain-inspired replay [51] algorithm, as shown below.

Experiment

To verify our algorithm, it was implemented using an 11th Gen Intel ® Core TM i7-11700 K processor @ 3.60 GHz × 16, an NVIDIA GeForce RTX 2080 Ti graphics processor with 32 GB of memory, and Python 3.5.2 and PyTorch 1.1.0 on Ubuntu 18.04. All experiments were performed on the same computer.

Baseline method

We compared the following baseline methods: joint training [52], learning without forgetting [45], synaptic intelligence [37], elastic weight consolidation [36], generative replay [13], and brain-inspired replay [51].

Joint training [52] (Joint): This training method stores all the data that have been learned before and retrains the model on all known data, which is effective but has an excessively high training cost.

Learning without forgetting [45] (LwF): Given new task data, we reserve the probability of obtaining the previous data as the goal of learning a new task and finally only use the new task data to train the network until all parameters converge.

Synaptic Intelligence [37] (SI): During the training of new tasks, the path integral of loss as a function of parameter changes will be calculated online to estimate the importance weight of parameters. Finally, the changes in the parameters that are important to previous tasks will be punished during the training of subsequent tasks.

Elastic Weight Consolidation [36] (EWC): This method uses the diagonal of the Fisher information matrix as a network parameter for importance measures while learning a new task. EWC uses a separate penalty for each prior task, protecting those important parameters within the parameter space.

Generative replay [13] (GR): This method trains the deep neural network sequentially without reference to past data. The approach has the framework of a dual-model architecture, which preserves previously acquired knowledge by simultaneously reinputting the generated pseudo-data and then pairs the generated data with the solution model responses of past tasks to represent the previous tasks. These generated data are interleaved with new data to update the network of generators and solvers when a new task is presented.

Brain-inspired replay [51] (BI-R): This method is a new, brain-inspired variant of GR. Without storing data, internal or hidden representations generated by the network's feedback connections are replayed.

BI-R + SI: Combining BI-R with SI, the basic network is a brain-inspired replay variant network, and experiments are performed in incremental-like scenarios.

To conclude, CL-BPUWM was compared with the above regularization methods, and then CL-BPUWM was combined with brain-inspired replay named BI-R + Our, and BI-R + SI was used as a baseline for experimental comparison.

Experimental setup

Dataset introduction

CIFAR-100 dataset [53] The CIFAR-100 dataset has 100 classes. Each class has 600 color images with a size of 32 × 32 pixels. The dataset contains 50,000 training images (500 images per class) and 10,000 test images (100 images per class). The original 32 × 32-pixel RGB color images were normalized (i.e., each pixel value was subtracted from the relevant channel mean, divided by the channel standard deviation, and the mean and standard deviation were calculated for all training images), but no other preprocessing or enhancement was performed. We divided the complete CIFAR-100 dataset into 5 tasks, 10 tasks, and 20 tasks, each of which is a disjoint subset of classes in the entire dataset.

CIFAR-10 dataset The CIFAR-10 Natural Image dataset contains images such as, but not overlapping, CIFAR-100. The dataset consists of 60,000 32 × 32 pixel RGB images, with a total of 10 categories of 6,000 images each. We also used a standard training/test split with 50,000 training images and 10,000 test images. We divided the CIFAR10 dataset into 5 tasks, each with 2 classes.

Split MNIST dataset [54]: For the split NINIST dataset, the numbers 0–9 in the original MNIST dataset are divided into five tasks, each containing two numbers (binary classification). As shown in Fig. 2, we divide the MNISI dataset in turn. Task 1 contains the numbers 0 and 1, task 2 contains the numbers 2 and 3, and so on. The original 28 × 28-pixel grayscale images were used without preprocessing. Again using the standard train/test split, there were 60,000 training images and 10,000 test images for each task (see Table 1).

Fig. 2
figure 2

Split the task division of the MNIST dataset, dividing the complete MNIST dataset into five tasks, each containing two non-overlapping numbers

Table 1 A basic overview of the dataset

Experimental details

Network architecture For a fair comparison, the same "base neural network" structure was used for all compared methods, similar to a previous study [51]. The base neural network structure of CIFAR-100 consists of five pretrained convolutional layers, two fully connected layers with 2000 ReLU nonlinear nodes per layer, and a softmax output. The first four convolutional layers all use a normalization layer, followed by ReLU nonlinearity. The bottleneck attention mechanism (BAM) is added after the third- and fourth-layer convolution normalization. To prevent overfitting during training, we also added dropout layers to the classification layer. Split MNIST's basic network is a fully connected network with two hidden layers, as well as a softmax output layer at the final classification layer dropout (see Table 2).

Table 2 The basic network structure used by the three data sets

Parameters Our network was optimized using 256 batches in the CIFAR-100 and CIFAR-10 experiments and 128 batches in the split MNIST experiments, and the adaptive optimizer Adam [55] (\(\beta_{1} = 0.9,\beta_{1} = 0.999\)) was used with an initial learning rate of \(1 \times 10^{ - 4}\). The training was performed using the cosine annealing algorithm [56] for 5000 iterations (CIFAR-100) or 2000 iterations (Split MNIST) for each task. In this benchmark, the optimizer's state is reset after training each task (see Table 3).

Table 3 Parameter setting of the model

Evaluation metrics

We report on standard measures of class incremental learning: accuracy and mean forgetting. Accuracy [22] is defined as the average accuracy of all learned classes, which is an important index to evaluate the performance of the model. We run the benchmark 10 times using different classes sequentially and then calculate the average. Average forgetting [57] refers to the forgetting of a previous task while performing a continuous learning task. After training task k, the forgetting measure of task i is defined as \(\mathop f\nolimits_{i}^{k} = \max \{ (\mathop a\nolimits_{1,i} - \mathop a\nolimits_{2,i} ),(\mathop a\nolimits_{2,i} - \mathop a\nolimits_{3,i} )....,(\mathop a\nolimits_{k - 1,i} - \mathop a\nolimits_{k,i} )\} ,\forall i < k\), where \(\mathop a\nolimits_{k,i}\) represents the test accuracy of task i after training task k. The number of tasks seen before is then normalized, and the average forgetting measure for all k tasks is written as \(F_{k} = \frac{1}{k - 1}\sum\nolimits_{i = 1}^{k - 1} {\mathop f\nolimits_{i}^{k} }\), where a lower \(F_{k}\) means less forgetting of the previous task.

Comparative experimental analysis

In this section, CL-BPUWM is compared with state-of-the-art methods on two datasets, CIFAR-100 and Split MNIST. These methods include the regularization methods SI [37], EWC [36], and MAS [38], the knowledge distillation method LwF [45], the GR method [13], and the BI-R method [51]. First, CL-BPUWM is compared with the knowledge distillation method LwF [45], regularization method SI [37], EWC [36], MAS [38], and generative replay GR [13] on the datasets CIFAR-100 and Split MNIST. Then, the CL-BPUWM is combined with the BI-R model (BI-R + Our) on the CIFAR-100 dataset, and BI-R + Our is compared with BI-R [51] and BI-R with SI combined with BI-R + SI.

Analysis of experimental results with existing regularization methods

In the class incremental learning scenario on the CIFAR-100 dataset, the dataset is divided into 5, 10, and 20 tasks for incremental learning, and CL-BPUWM is compared with the generative replay method GR, regularization method SI, EWC, and MAS. The incremental learning results on the CIFAR-100 dataset are shown in Fig. 3. Figure 3 shows the average accuracy across tasks for different methods on different task divisions. In dividing the complete CIFAR-100 dataset into 5, 10, and 20 tasks, CL-BPUWM largely outperformed the regularization methods SI, EWC, and MAS as well as GR in terms of classification accuracy. When there were 5 tasks, CL-BPUWM was 2.2% higher than MAS and 1.93% higher than GR. With 10 and 20 tasks, CL-BPUWM was 1.57% and 1.11% higher than MAS and 2.15% and 0.70% higher than GR, respectively.

Fig. 3
figure 3

The test accuracy of regularization method and CL-BPUWM is compared on the CIFAR-100 dataset. There are 100 classes in total, divided into 5, 10, and 20 tasks, each containing 20, 10, and 5 classes. The figure shows the average accuracy of the 10 benchmark tests

The CIFAR-10 dataset is divided into 5 tasks, each with 2 classes. Compared to regularization techniques such as SI, EWC, and MAS, as well as knowledge distillation LwF, CL-BPUWM demonstrates superior performance. In Fig. 4, the classification accuracy of CL-BPUWM surpasses SI by 2.24% and GR by 0.7%.

Fig. 4
figure 4

The accuracy of the regularization method and CL-BPUWM is tested on the CIFAR-10 dataset. There are 10 classes in total, divided into 5 tasks, and each task contains 2 class. The figure shows the average accuracy of the 10 benchmark tests

Under the split MNIST dataset, we divided the complete MNIST dataset into five tasks based on images containing two digits and compared CL-BPUWM with the regularization methods SI, EWC, and MAS and the knowledge distillation method LwF. As shown in Fig. 5, CL-BPUWM exhibits better classification performance than the existing regularization and knowledge distillation methods.

Fig. 5
figure 5

Test the accuracy of the regularization method and CL-BPUWM on the split MNIST dataset. There are 10 classes in total, divided into 5 tasks, and each task contains 2 class. The figure shows the average accuracy of the 10 benchmark tests

Combination experiments with replay models under the CIFAR-100 dataset

Combining CL-BPUWM with brain-inspired replay (BI-R + Our), the CIFAR-100 dataset was divided into 5 tasks and 10 tasks. Comparing BI-R + Our with BI-R and BI-R + SI for each task, Tables 4 and 5 show the average accuracy after 10 training times per task for CL-BPUWM in incremental learning. From the results of both tables, we know that the average classification accuracy of all tasks after the combined CL-BPUWM and BI-R is 12.48% and 14.38% higher than BI-R alone and 1.4% and 2.35% higher than BI-R + SI, respectively.

Table 4 Accuracy of different methods on 5 tasks in the CIFAR-100 dataset
Table 5 Accuracy of different methods on 10 tasks in the CIFAR-100 dataset

Finally, all replay methods (GR, BI-R, BI-R + SI, BI-R + Our) are compared in the incremental learning setting including all the classes thus far (i.e., 100 classes). As shown in Fig. 6, 100 classes are divided into 20, 10 and 5 phases to complete incremental learning, and each phase learns incremental learning results under three different divisions of 5, 10 and 20 classes, respectively. In the incremental learning of 5 class and 10 class batches, as shown in Fig. 6a, the results of BI-R + Our were better than those of BI-R + SI in the 2nd–6th incremental batches. The effect in batches 7–8 had similar performance to BI-R + SI, but the test accuracy of BI-R + Our was higher than that of BI-R + SI after the 9th increment. Figure 6b shows the results of incremental learning of 10 class batches. The incremental learning classification accuracy of BI-R + Our is better than that of BI-R + SI. Figure 6c shows incremental learning for 20 class batches. BI-R + Our has similar performance to BI-R + SI in the first three incremental batches; however, BI-R + Our significantly outperforms BI-R + SI in subsequent incremental batches. These incremental learning experiments with three different batch sizes show that BI-R + Our outperforms BI-R + SI in terms of both final incremental accuracy and average incremental accuracy.

Fig. 6
figure 6

Incremental learning results on CIFAR-100 (accuracy %). a Incremental batches of 5 classes, b incremental batches of 10 classes and c incremental batches of 20 classes. The horizontal coordinate represents the number of all classes learned in the dataset, and the vertical coordinate represents the accuracy of the test

Combination experiments with replay models under the CIFAR-10 dataset

We also conducted binding experiments under the CIFAR-10 dataset, combining CL-BPUWM with brain-inspired replay (BI-R + Our). CIFAR-10 was divided into five tasks, each containing two classes, which were compared to the accuracy of BI-R and BI-R + SI for each task. As with the CIFAR-100 setup, the average accuracy of the corresponding tasks was taken after 10 training sessions, and the results in Table 6 show that CL-BPUWM combined with BI-R has much higher accuracy than BI-R and BI-R + SI for the first three tasks, except for the fourth task and the fifth task. In addition, BI-R + Our has a higher average classification accuracy than BI-R for all tasks of 12.42% and 5.38% higher than BI-R + SI.

Table 6 Accuracy of different methods on 5 tasks in the CIFAR-10 dataset. The table shows the average accuracy of the model test over 10 runs

We compare all replay methods (GR, BI-R, BI-R + SI, and BI-R + Our) in an incremental learning setup. Using all classes thus far (i.e., 10 classes). Figure 7 shows the results of dividing the 10 classes into 5 incremental learning results. After the second increment, the test accuracy of BI-R + Our differs significantly from that of the baseline method.

Fig. 7
figure 7

Incremental learning results (accuracy %) for the last 2 classes of CIFAR-10. The horizontal coordinate represents the number of all classes learned in the dataset, and the vertical coordinate represents the accuracy of the test

The experimental results show that CL-BPUWM more effectively and robustly handles incremental class learning. As the number of classes increases, there is a difference in the number of old and new classes, along with the presence of visually similar classes. This introduces a strong bias toward new classes and misclassifies visually similar old classes. CL-BPUWM is effective in reducing this bias and improving classification accuracy.

Analysis of average forgetting

To compare the effectiveness of mitigating catastrophic forgetting, we show the average forgetting results under the CIFAR-100 dataset, the CIFAR-10 dataset, and the split MNIST dataset in Table 7, which is calculated as shown in Section “Analysis of average forgetting”. Our method is less forgotten in 10-phase and 20-phase on CIFAR-100. It also has less forgetting than the baseline method on CIFAR-10 and split MNIST. In conclusion, in terms of average forgetting, our method outperforms knowledge distillation methods and some regularization methods.

Table 7 Indicators of forgetting for continuous learning (the lower the better)

Analysis of parametric calculations

Table 8 compares the parameter computation of the proposed method with GR and BI-R. The results show that our proposed method is lower in parameter computation than the baseline method in all cases. Under the three datasets, the proposed method reduces the amount of parameter computation by 12.68%, 14.81%, and 19.94%, respectively, compared with GR. The reduction in parameter computation is even more pronounced when compared to BI-R, with reductions of 22.24%, 15.80%, and 24.05%, respectively.

Table 8 Experimental results of calculation amount of algorithm parameters in CIFAR-100, CIFAR-10 and Split MNIST datasets

Ablation study

To further verify the performance of the CL-BPUWM method, we analyzed the effectiveness of each part of the proposed method on the CIFAR-100, CIFAR-10 and split MNIST datasets. We mainly analyzed the effects of the following aspects: (1) parameter changes on model output changes (MOV), (2) Bayesian parameter updates (BPU), (3) dropout regularization, and (4) cosine annealing algorithms on model performance with the CIFAR-100 and Split MNIST datasets.

Ablation study on the CIFAR-100 dataset

From the results of Table 9, we can observe that (1) in BPU, BAM, and dropout regularization without Bayesian parameter update strategies, the sensitivity of using only MOV is poor. (2) The combined use of MOV and BPU is more effective than that of MOV alone, indicating that BPU can make the model adapt to the current task faster and improve classification accuracy. (3) Dropout successfully alleviates the phenomenon of model overfitting and improves classification accuracy. (4) The learning rate is dynamically adjusted using a cosine annealing algorithm, and the classification accuracy of the model is also improved to a certain extent.

Table 9 The validity of each component in CL-BPUWM on the CIFAR-100 dataset

Ablation study on the CIFAR-10 dataset

Table 10 shows the ablation experiments under the CIFAR-10 dataset, which shows that when only the effect of parameter variations on model output changes (MOV) is considered, the accuracy is only 19.87%, which is comparable to the performance of MAS. The 0.71% accuracy is improved after the joint Bayesian weight update (BPU), which also outperforms the performance of the existing regularization methods. There is also an improvement in accuracy when dropout is added to keep a portion of the model's neurons in an inactivated state. Finally, the addition of the cosine annealing algorithm to dynamically adjust the learning rate gives the algorithm an accuracy of 21.13 on CIFAR-10, which is a 1.26% accuracy improvement over considering only the parameter variation on the model output variation (MOV).

Table 10 The validity of each component in CL-BPUWM on the CIFAR-10 dataset

Ablation study on the split MNIST dataset

It can be seen from Table 11 that in BPU, BAM, and dropout regularization without the Bayesian parameter update strategy, the accuracy of considering only the influence of MOV is only 19.93%, 0.43% lower than that of BPU. In addition, the impact of using MOV changes based on parameter importance. BPU + MOV improves the classification accuracy. Dropout regularization improves the overfitting of the mitigating model. Using the cosine annealing algorithm to dynamically adjust the learning rate, the highest accuracy on the split MNIST dataset was 20.71%, which was 0.78% higher than the effect of MOV alone.

Table 11 The validity of each component in CL-BPUWM on the split MNIST dataset

By using BPU in class incremental learning, the model can learn the current task more quickly and effectively. At the same time, combined with the impact of parameter changes on the output of the model, the importance of updated parameters is calculated, and the important parameters are kept relatively unchanged, which can enable the model to retain knowledge of the previous tasks while learning new tasks, ultimately alleviating catastrophic forgetting.

Conclusion

We propose a continuous learning method based on Bayesian parameter updating and weight memorization (CL-BPUWM). Specifically, the proposed method predicts the likelihood of unknown events from past statistics by probabilistic means, which enables a good fit with incremental learning and improves the model's problem of categorizing the knowledge of a new task. Additionally, the problem of forgetting the knowledge of old tasks in continuous learning is solved using regularization methods to limit the parameter variations that are more important to the old tasks when learning the knowledge of new tasks. In addition, dropout regularization is also introduced into the model to alleviate the model overfitting phenomenon, which provides a further improvement to the classification accuracy. We compare CL-BPUWM with four continuous learning regularization algorithms on the CIFAR-100, CIFAR-10 and split MNIST datasets to validate the effectiveness of the method proposed in this work; we also validate the feasibility of the method in class incremental learning after combining CL-BPUWM with the brain-inspired replay model (BI-R + Our). In addition, we conducted extensive ablation experiments on three datasets to validate the effects of different components of CL-BPUWM on classification performance. The experimental results show that CL-BPUWM helps to extend the model to learn new knowledge more efficiently while preserving knowledge related to old tasks. In future work, we will focus on improving the generation model to generate old task samples with better quality and consider the integration of our proposed regularization method to improve model performance and alleviate the forgetting of old knowledge.