Memory Protection Generative Adversarial Network (MPGAN): A Framework to Overcome the Forgetting of GANs Using Parameter Regularization Methods

Generative adversarial networks (GANs) suffer from catastrophic forgetting when learning multiple consecutive tasks. Parameter regularization methods that constrain the parameters of the new model in order to be close to the previous model through parameter importance are effective in overcoming forgetting. Many parameter regularization methods have been tried, but each of them is only suitable for limited types of neural networks. Aimed at GANs, this paper proposes a unified framework called Memory Protection GAN (MPGAN), in which many parametrization methods can be used to overcome forgetting. The proposed framework includes two modules: Protecting Weights in Generator and Controller. In order to incorporate parameter regularization methods into MPGAN, the Protecting Weights in Generator module encapsulates different parameter regularization methods into a “container”, and consolidates the most important parameters in the generator through a parameter regularization method selected from the container. In order to differentiate tasks, the Controller module creates unique tags for the tasks. Another problem with existing parameter regularization methods is their low accuracy in measuring parameter importance. These methods always rely on the first derivative of the output function, and ignore the second derivative. To assess parameter importance more accurately, a new parameter regularization method called Second Derivative Preserver (SDP), which takes advantage of the second derivative of the output function, is designed into MPGAN. Experiments demonstrate that MPGAN is applicable to multiple parameter regularization methods and that SDP achieves high accuracy in parameter importance.


I. INTRODUCTION
Generative adversarial networks (GANs) [1] have led to significant improvements in image generation [1], [2], image deblur [3], super-resolution [4], [5], and other domains. Unfortunately, when trained on multiple tasks arriving sequentially, GANs always forget previously learned knowledge, which is known as catastrophic forgetting [6]. Catastrophic forgetting severely restricts the applications for GANs The associate editor coordinating the review of this manuscript and approving it for publication was Pengcheng Liu . in real-world scenarios [7]- [9]. For example, it is impractical for hospitals to retain patient data permanently due to privacy regulations or expensive costs. When suffering from catastrophic forgetting, GANs would encounter the challenge of reusing learned knowledge from old patient data without storing them.
Many methods have been proposed to alleviate catastrophic forgetting. The mainstream methods for overcoming forgetting of convolutional neural networks (CNNs) [8]- [19] can generally be divided into three categories: transfer learning approaches, rehearsal mechanisms, and parameter regularization methods. Transfer learning approaches [8]- [10] attempt to solve the problem of catastrophic forgetting by relaying previously learned information to the current model. The trouble with transfer learning approaches is that they generally require the preservation of all of the previously learned model parameters and therefore do not scale with a large number of tasks. In the rehearsal mechanisms [11]- [13], the past knowledge is regularly replayed with real samples drawn to jointly optimize the network parameters that can be used for the new task. The problem with rehearsal mechanisms is that the joint training of past data and new data is both time-consuming and working memory-intensive, especially when the number of tasks is very large. Parameter regularization strategies [14]- [19] constrain learning efficiency on certain weights of the new task based on how important they are to previous tasks in order to keep close to the previous model. Compared with transfer learning approaches and rehearsal mechanisms, parameter regularization methods do not have to preserve a past model, retrain on old data, or add an explicit memory, which is cost-effective in real-world environments. For this reason, parameter regularization strategies are proposed to solve the problem of catastrophic forgetting in GANs in this work.
Although several proposed parameter regularization methods such as Elastic Weight Consolidation (EWC) [14] and Memory Aware Synapses (MAS) [15], have significantly counteracted forgetting in certain CNNs, no unified framework exists that can overcome forgetting by various different parameter regularization methods. Aimed at GANs, to bridge this gap, this work proposes the Memory Protection GAN (MPGAN). The advantage of MPGAN is that as long as a strategy refers to parameter regularization methods, it can be used in the MPGAN framework to overcome the problem of GANs' forgetting. To construct MPGAN, two issues must be resolved. The first is that although tasks always continuously come one after another, they should be discrete. Thus, how to differentiate each task coming in sequentially should be addressed first. The second issue is how to extract the commonality of and preserve the difference between different parameter regularization methods when merging them into one framework. To address these two issues, the Controller module and the Protecting Weights in the Generator (PWG) module are developed. By organically combining the two modules with traditional GANs, the proposed MPGAN framework is created, which can utilize various parameter regularization methods to address the problem of GANs' forgetting.
Different parameter regularization methods [14], [15] calculate the parameter importance through different methods. Traditional parameter regularization methods may underestimate the parameter importance because they only rely on the first derivative of the output function to compute the importance. The first derivative will be close to zero when CNNs converge, but sometimes the second derivative will still exist even if the first derivative is minimal. Therefore, traditional parameter regularization methods that only use the first derivative of output function would underestimate the importance for some parameters. To compute the parameter importance more accurately, this work creates a parameter regularization approach called Second Derivative Preserver (SDP), which provides the additional second derivative on the original first derivative of output function. SDP is also incorporated into MPGAN and achieves satisfactory performance in mitigating the forgetting of GANs.
Specific contributions of this work include: 1) This work proposes a unified framework, called MPGAN, which consists of a traditional GAN model, a Controller module and a PWG model to address the problem of catastrophic forgetting in GANs. The framework can use different parameter regularization methods.
2) This work proposes a more accurate parameter regularization method, called SDP, which takes advantage of the second derivative as an amendment to the first derivative of output function. SDP is applied in MPGAN.

II. RELATED WORKS
There are three categories of methods of investigating catastrophic forgetting: transfer learning approaches, rehearsal mechanisms, and parameter regularization methods.
Transfer learning approaches [8]- [10] mitigate catastrophic forgetting by reusing previously acquired knowledge in the current model. Progressive neural networks [8] are immune to forgetting through their ability to leverage prior knowledge via lateral connections to previously learned features. Deep block-modular neural networks [9] explore a block-modular architecture for CNNs, which allows parts of the existing network to be re-used for the purpose of solving a new task without a decrease in performance when solving the original task. Transfer learning has been applied in GANs. For example, Lifelong GAN [10] employs knowledge distillation to transfer learned knowledge from previous networks to the new network. Lifelong GAN distills information from the old model to the new model by encouraging the two networks to produce similar output values or patterns given the auxiliary data as inputs. The major drawback of transfer learning is that an external working memory is needed to preserve all of the previously learned model parameters.
In rehearsal mechanisms [11]- [13], the data from old tasks are sampled from generated pseudodata and interleaved with the data of new tasks to be jointly trained in order to consolidate the old knowledge. For example, many rehearsal strategies are designed to replay previous memories in CNNs, including choosing recent memories, and choosing memories most similar, or least similar, to new ones [20]. GANs are a type of generative model and the generator has an active role in generating images. As a result, the rehearsal mechanisms have been widely used to overcome GANs' forgetting. For example, in a Deep Generative Replay network [11], the generator is paired with a solver and they are selected to allow sequential learning on multiple tasks by generating and rehearsing fake data that mimics former training examples. When it comes to Memory Replay GANs (MeRGANs) [12], two models that refer to the joint training with replay and replay alignment are proposed and are used to prevent forgetting by leveraging replays. Besides GANs, rehearsal mechanisms also successfully address forgetting of other generative models such as Variational Autoencoder (VAE) [21]. In lifelong generative modelling [13], a student-teacher architecture is introduced to learn and preserve all the distributions seen so far to mitigate forgetting of VAE. Most rehearsal strategies require regenerating samples from previous categories and jointly retraining new samples with these additional replayed samples. However, there are two shortcomings when employing the rehearsal strategy to overcome the GANs' catastrophic forgetting in real-world environments [11], [12]. One is that regenerating old pseudodata from old models is highly time-consuming. The other is that training old data from the previous task together with new data is even more time-consuming, and also uses a great deal of storage space.
The parameter regularization methods [14], [15] also show good performance in overcoming forgetting in CNNs. They attempt to reduce representational overlap among tasks by regularization such as freezing the weight of a convolutional layer and weight consolidation. The classic parameter regularization method of EWC [14] estimates how essential each parameter is at the conclusion of a task through diagonal precision provided by the diagonal of the Fisher information matrix [22]. It then prevents the model from straying too far from the parameters that are important for the task. Similar to EWC [14], MAS [15] also computes the importance of the parameters and penalizes changed important parameters, but the method of calculating the importance of parameters is based on how sensitive the predicted output function is to a change in this parameter. The intelligent synapses (IS) model [19] protects the parameters according to their importance along the task's entire training trajectory. In IS, each parameter of the CNNs is awarded an importance measure based on how much it reduces the loss while learning tasks. These parameter regularization methods have been utilized in GANs. For example, the Continual Learning GAN [20] employs EWC to protect parameters in the generator critical for previous tasks from being replaced by those of latter tasks to solve GAN's forgetting. EWC-GAN and IS-GAN [21] each protect important parameters in the discriminator and achieves satisfactory results in both forgetting and mode-collapse. Similarly, the self-supervised GAN [22] focuses on the discriminator and class information plays an important role in identifying real images. It encourages the discriminator to maintain useful representations as class information by adding a rotation-based loss to the original loss function of the discriminator. Although these reconstructed GANs can minimize forgetting, they only focus on one parameter regularization method, which suggests that other parameter regularization methods may fail when applied to these models. This paper attempts to create a framework that can utilize various parameter regularization methods to prevent forgetting in GANs.

III. PRELIMINARY KNOWLEDGE
In this section, some preliminary knowledge including vanilla GANs and traditional parameter regularization methods are presented.

A. VANILLA GANs
Two neural networks -a generator G and a discriminator D are simultaneously trained in the traditional GAN framework, as shown in Figure 1. For image generation, G captures the data distribution and generates a fake image as output. D estimates the probability that a sample came from the training data (true image) rather than G. The training procedure for G is to maximize the probability of D making a mistake [1]. In other words, D and G play the following two-player minimax game with loss function (1): where x represents the true data, P data(x) is the distribution of x, z is input noise variables, P z (Z ) is a prior on z, and D(x) represents the probability that x came from the data rather than G.

B. TRADITIONAL PARAMETER REGULARIZATION METHODS
Classic parameter regularization methods such as EWC [14] and MAS [15] have been shown to be effective in calculating parameter importance. The EWC regularization method assumes that the parameters associated with the higher Fisher diagonal elements are considered more important [8], so it uses the Fisher information matrix to calculate parameter importance in CNNs as shown in (2): where I EWC (θ ) is the parameter importance calculated by EWC, F is the output function, θ is the set of all parameters in CNNs. When it comes to computing the parameter importance in the generator, it can be expressed as the following (3): where z is a random noise and G is the generator. MAS [10] measures the parameter importance calculated by the squared l 2 norm of the output function as shown in (4) [10]: where I MAS (θ ) is the parameter importance calculated by MAS, F is the output function, and F(θ ) 2 is the l 2 norm of the output function. Accordingly, when applying MAS into the generator, the importance can be estimated through (5): where z is a random noise and G is the generator.

IV. MPGAN FRAMEWORK AND SDP
In this section, on the basis of the above preliminary knowledge, the proposed MPGAN is described in greater detail by first analyzing the reason forgetting exists in GANs and then introducing strategies to address it, and the SDP method is discussed.

A. CAUSE OF FORGETTING IN TRADITIONAL GANS
As mentioned above, the generator generates images and the discriminator provides the optimization information to guide the generator's generation through gradient descent. If the parameters in the generator are updated completely according to standard gradient descent when multiple tasks are coming sequentially, the parameters achieved on previous tasks would be erased and thus their performance would decrease abruptly, which will lead to catastrophic forgetting [6], [23]. Consider two tasks coming in consecutive. The first task (T1) is to generate dogs; the second task (T2) is to generate cats. The models of the generator and the discriminator in traditional GANs have already been trained to generate dogs (T1). When T2 arrives, the discriminator aims to distinguish the true cats' images from the faked ones. Thus, the discriminator guides the generator to update the parameters to cater to T2. If the generator is absolutely obedient to the discriminator's gradient descent direction, it would lose the information of the old task (T1). Thus, one way to mitigate forgetting is to prevent parameters critical for previously learned tasks from undergoing drastic changes in value. Based on this idea, MPGAN framework is created.

B. MPGAN FRAMEWORK
In this part, the structure, operation process and principle of the proposed MPGAN is examined. Two issues must be addressed to solve catastrophic forgetting of GANs. The first is how to differentiate tasks coming sequentially and continuously, which is a problem that all neural networks face when attempting to overcome forgetting. The second issue, as analyzed in Section IV A, is how to prevent previously learned important information in the generator from being changed by new ones through a series of different parameter regularization methods. To solve the two issues, the Controller and the Protecting Weights in Generator (PWG) are introduced and MPGAN is created. The framework of the MPGAN is shown in Figure 2.

1) CONTROLLER
Tasks coming consecutively should be discrete and independent, thus the Controller is proposed to differentiate tasks. Specifically, the Controller receives a task as input and creates VOLUME 8, 2020 a unique tag for this task, as shown in (6): where T i is the i-th task and tag i is the tag that the controller creates for the T i . After the tag is created, it is combined with the data of the task and sent into GAN for training. The tags have varying modalities. In the simplest case, the tags can be one-hot encoding or feature maps. More generally, they can be structured objects, such as a paragraph of natural language explaining how to solve the task. The tags are important because they are the only identity of the past task when the data of the task is inaccessible for the reason of storage cost, or privacy. The old task related to the tag will be evoked when the MPGAN receives the tag that it has encountered before. The workflow of the Controller is shown in Figure 3.

2) PWG
Per the analysis in Section IV A, if the parameters in the generator are protected from being completely changed by the new ones, the problem of forgetting in GANs may be alleviated. PWG is proposed to protect learned critical parameters in the generator through various parameter regularization methods.
Utilizing the idea of encapsulation service, PWG first establishes a container that encapsulates many different parameter regularization methods, including EWC, MAS and so on. Then, when the training of T i is finished, PWG assesses the parameter importance of T i through a selected parameter regularization method. Different parameter regularization methods have their own way in computing parameter importance. Note that no matter which method is chosen, the value of importance is set to zero if it is negative in order to reduce the resistance of learning new tasks as shown in (7): where θ i represents the parameters of T i in the generator. I i is the parameter importance of θ i calculated by the parameter regularization methods. PWG(θ i ) is the final parameter importance after the negative importance is set to zero for T i . Finally, PWG constraints the drastic movement of important parameters while allowing unimportant parameters to be updated for T i+1 according to their importance. PWG achieves this procedure by adding a penalty term on the original loss function of the T i+1 task as shown in (8): where θ i+1 is the parameters of T i+1 in the generator. λ represents how important the previous task is to the latter task. L i+1 is the original GAN loss function of T i+1 . L new i+1 is the new MPGAN loss function of T i+1 .
Reasonably, parameters with large value in importance are salient in preserving the information of the learned task and the big perturbation in its value would cause drastic decrease on its performance, so the large change in important parameters would be punished significantly.

C. SDP METHOD
The accuracy of calculation in parameter importance in PWG affects the results of overcoming forgetting of MPGAN.
Although both EWC and MAS shown in Section III B can estimate parameter importance, the accuracy of importance may not be high. To be specific, EWC and MAS rely only on the first derivative of output function to approximate the true parameter importance. However, at some times the second derivative may still be high even when the first derivative of the output function is small, and thus EWC and MAS which only depend on the first derivative of output function would cause substantial underestimation of the importance for certain parameters. This work creates a new parameter importance calculation method, called SDP. Unlike EWC and MAS, the SDP method estimates parameter importance by adding the second derivative of the output function to the original first derivative.
In SDP, given a well-trained model, parameter importance is measured by how much the performance of the model degrades when a parameter θ k is removed, as shown in (9): (9) where F is the output function of the model, δF represents how much degradation of performance the loss of parameter θ k results in, and θ is the set of all parameters in the model.
Parameter θ k is more important when its δF is larger. Calculating the parameter importance θ k in a CNN one by one is tedious and sometimes may be computationally complex, as a alternative, δF in (9) can be expressed in the method of Taylor expansion for all parameters together, as shown in (10): where ∂F/∂θ represents the gradient on parameters, and ∂ 2 F/∂θ 2 is the Hessian matrix on parameters. The gradient will be close to zero when CNNs converge. As a result, the first derivative of δF will be too small to obtain a precise calculation of parameter importance, and the second derivative of δF is employed as an amendment. Hence, parameter importance in SDP is calculated by (11): Fisher information is closely related to the curvature of the log likelihood function, as measured by its Hessian matrix [23]. Due to the complexity of the calculation of the Hessian matrix, in practice, the Fisher information is introduced to approximate the Hessian matrix by (12): where E[( ∂F ∂θ ) 2 ] represents the Fisher information.

D. THE WORKFLOW OF MPGAN
In this section, the workflow of MPGAN is summarized. On the first task, MPGAN is trained and tested as the vanilla GAN is, except for the task tag supplied to the generator. The importance of parameters in the generator is evaluated according to the chosen regularization method. On further tasks, MPGAN is trained with different task tags and with the regularization of parameter importance with respect to the previous tasks. After each new stage, the performance on the tasks up to this stage can be tested by supplying the correct tags.

Algorithm 1 Training the Model at the i-Th Task
Input: The added data of new task T i . Require: The model of previous task T i−1 1: if i = 0 then 2: Create a tag for T i according to equation (6) 3: Train MPGAN to perform T i according to equation (1) 4: else 5: Create a tag for T i according to equation (6) 6: Select a parameter regularization method from EWC, MAS or SDP 7: Calculate parameter importance of T i−1 according to equation (3), equation (5) or equation (12) 8: Update the MPGAN to perform T i according to equation (8) The detailed training processes of MPGAN is shown in Algorithm 1 and the task of image generation is used as an example. Sequential multi-tasking image generation can be expressed as {T 0 , . . . , T i }. For T i−1 , the Controller creates a tag tag i−1 for it. Then, T i−1 is sent to the MPGAN for training. After T i comes, the controller creates a tag tag i for it, and PWG begins its operation. It first chooses a regularization method to calculate the parameter importance I i−1 of the generator. When training T i , PWG protects the parameters in the generator to retain old values proportional to their I i−1 .
For testing, if previous {T 1 , . . . , T i−1 } are required to be performed again, the controller sends their tags to MPGAN to evoke its corresponding memories.

A. DETAILS OF EXPERIMENTS
GAN models used in MPGAN: Three different GAN models were selected to demonstrate the effectiveness of MPGAN. These GAN models can all generate high quality images. But they suffer from catastrophic forgetting when comes to continual image generation. The three GAN models were therefore used to check the effectiveness of the MPGAN framework in overcoming catastrophic forgetting.
• WGAN-GP [24]: It is known for its ability to stabilize training.
• InfoGAN [25]: It can disentangle representation which is helpful for learning salient features of data.
• SAGAN [26]: It can generate high-resolution details by using cues from all feature locations. Baseline: MPGAN was compared with four baselines. • Joint learning (JL) [27]: all tasks are trained jointly in a non-sequential setting. JL is the upper boundary because no forgetting exists in it.
• Stochastic Gradient Descent (SGD) [27]: tasks are learned in a sequential setting without any strategy to prevent forgetting. SGD is the lower boundary.
• Deep Generative Replay network (DGR) [11]: Images generated by a generator trained on previous tasks are combined with the training images for the current task to form a hybrid training set. This belongs to the rehearsal mechanisms.
• Lifelong GAN (LLG) [10]: It prevents forgetting by utilizing knowledge distillation to transfer learned knowledge from previous networks to the new network. This belongs to the transfer learning methods. Dataset: Four image generation datasets were selected to verify the effectiveness of MPGAN. The usage of the four datasets is shown in Table 1.
• SVHN [29]: It contains digits 0 to 9 obtained from house numbers in Google Street View images.
• CelebA [30]: It is a large-scale facial attributes dataset with more than 200,000 images of celebrities.
• Anime-Face-Dataset 1 . It is a collection of high-quality anime faces. In the following experiments, MNIST and SVHN are divided into two parts for the datasets of two continual tasks respectively.
Evaluation: The performance of MPGAN in overcoming forgetting was measured through four metrics.
• Average classification accuracy (ACC) [31]: ACC pertains to the accuracy of the classifier network trained on real images and evaluated on generated images. In our experiments, ACC is used to measure the quality of samples in previous tasks. Intuitively, if forgetting occurs, MPGAN would not regenerate previous samples clearly and properly, and the ACC would be low, and vice versa. Thus, the ACC can be used as a metric to measure the forgetting of MPGAN. Higher ACC indicates better quality of samples in previous tasks and less forgetting.
• Frechet Inception Distance (FID) [32]: FID is a commonly used metric for evaluating generative models. It compares the statistical properties of generated samples to real samples. A lower FID score indicates more realistic images that match the statistical properties of real images [33]. FID is sensitive to both quality and diversity.
• Backward Transfer (BWT) [31]: BWT measures the influence that learning a latter task has on the performance on a former task. It describes how much the new task weakens the previous tasks.
• Computational overhead: It includes training time and memory size. They measures how much time and memory size different models require during the training. MPGAN can overcome forgetting more significantly when ACC and BWT are higher and FID and computational overhead are lower.
The formulas for ACC, BWT and FID are (13), (14), and (15), respectively: where T is the number of tasks, and R i,j is the test classification accuracy of the model on task t j after observing the last sample from task t i . (15) where the statistics µ r ,µ g and r, g are obtained by the activations of the InceptionV3 network that the real and generated samples are fed into, and they represent the means and covariance matrices of the real and generated samples in the feature space respectively. In the following experiments, the best values among different models on each metric are rendered in bold and the best values among MPGAN models are underlined. Note that no forgetting occurs in JL, so its ACC and FID are always the best. In addition, BWT in JL has no value, which is represented by 'NA'. No strategy is applied into SGD to prevent forgetting, so its computational overhead is always the minimum.

B. RESULTS OF OVERCOMING CATASTROPHIC FORGETTING
In this section, the effectiveness of MPGAN on overcoming the GANs' catastrophic forgetting in image generation is evaluated as a preliminary experiment.
WGAN-GP was employed as the GAN model in the MPGAN framework, and MPGAN was called MPGAN WGAN-GP in this experiment. MPGAN WGAN-GP could be specified into MPGAN EWC WGAN-GP , MPGAN MAS WGAN-GP and MPGAN SDP WGAN-GP when EWC, MAS and SDP were chosen as the parameter regularization method, respectively. CelebA and Anime-Face-Dataset were the test datasets. MPGAN WGAN-GP was trained to generate the celebrity faces as the first task. Next it was trained to generate the anime faces as the second task.
To determine whether MPGAN WGAN-GP can counter forgetting, whether it can still generate the celebrity faces in the first task when training the second task is checked. Intuitively, if forgetting occurs, the celebrity faces in the first task would not be generated and quickly vanish in the few training epochs of the second task. If MPGAN WGAN-GP overcomes forgetting, the celebrity faces would still be regenerated during the whole training of the second task.   Tables 2 and 3. The ACC of regenerated celebrity faces in all MPGAN WGAN-GP models ranged from 65% to 92%, but the ACC of SGD declined sharply to only 1.73% and 0.05% in the 10-th and 30-th epochs, respectively. Similarly, the marked contrast in FID between all of the MPGAN WGAN-GP models and SGD illustrate that MPGAN WGAN-GP models could significantly maintain their original data distributions of previously learned samples.
In conclusion, the presented regenerated samples in the first task indicate that MPGAN WGAN-GP can, to a certain extent, hinder important old parameters from drifting to new values. In addition, MPGAN SDP WGAN-GP can best memorize the previous samples for its most accurate calculation of importance of parameters in all of the MPGAN WGAN-GP models.

C. CONTINUAL IMAGE GENERATION RESULTS
If catastrophic forgetting is prevented, MPGAN would not completely forget previous knowledge. Therefore, MPGAN may learn a continual sequence of tasks. In this section, VOLUME 8, 2020    SAGAN , respectively. The experimental process was as follows. The dataset was divided into two parts: one was to generate digits 0, 1, 2, 3, and 4 as the first task and one was to generate digits 5, 6, 7, 8, and 9 as the second task. MPGAN trained the two tasks coming in sequentially as shown in Algorithm 1. Regarding testing, the tags of the two tasks were fed into MPGAN, in order to check whether the samples of the old and new task can be regenerated. If the regenerated samples for both tasks were satisfactory, MPGAN's ability to learn continual image generation tasks could be validated.
The regenerated results of MNIST and SVHN are shown in Figure 5 and Figure 6, respectively. The ten blocks in Figure 5 and Figure 6 represent the regenerated digits 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9, respectively. In each block, the first column is the results of JT; the second, the third and the  show that the previously learned digits 0 to 4 generated by SGD are hardly recognizable because SGD completely forgets acquired knowledge. In contrast, as shown in Figure 5 and Figure 6, all MPGAN InfoGAN and MPGAN SAGAN models generate proper and clear digits in the first task, which demonstrates that MPGAN would not forget acquired digits. In addition, the satisfactorily regenerated digits 5 to 9 demonstrate that MPGAN could correctly perform the second image generation task again. Table 4 and In addition, the computational overhead including the training time per epoch in the second task and parameters of different methods were measured. JL consumes large memory because it requires that all data are available and trained together, which is impractical in real scenarios. DGR also occupies a number of memory and time because it regenerates some portions of previous data and trains them with new data together. LLG does not need as much memory as JL and DGR, but the distillation loss also requires the preservation of old data as auxiliary data and the distillation loss itself has to store the old output results. Compared with the JL, rehearsal mechanisms and transfer learning approaches, MPGAN which uses parameter regularization methods consumes just a half or a quarter of time and memory. The most memory and time-consuming part lies in the calculation of parameters importance and their storage.
Moreover, the value of λ represents how important the former task is to the latter task, showing that λ influences the hindering of forgetting. The different λ in MPGAN models were tested on the first task, and red boxes were placed around the best values of λ as shown in Figure 7. The best value of λ is the one that can assist MPGAN Figure 7, it is also indicated that small values of λ would cause a rapid deviation from the models' previous task values when starting to train the next task. Large values of λ would interfere with new tasks and have no obvious improvement on the quality of images.
In summary, the qualitative and quantitative experimental results of MPGAN InfoGAN and MPGAN SAGAN on MNIST and SVHN illustrate that MPGAN can recollect the past knowledge and achieve continual image generation. But the overall performance of MPGAN InfoGAN is better than that of MPGAN SAGAN . Specifically, the ACC on the first task of MPGAN InfoGAN are 8%-10% higher than that of MPGAN SAGAN . The FID of MPGAN InfoGAN is 11-20 lower than that of MPGAN SAGAN . MPGAN InfoGAN forgets less 0.2%-1.6% knowledge of the past than MPGAN SAGAN from the values of BWT. The main reason for inferior performance on SVHN is that the images in SVHN are more complicated than those in MNIST in color, texture and background, so preserving the detailed information is more difficult.
In this experiment, the successful regeneration of former and latter tasks demonstrates that MPGAN is capable of learning new knowledge while remembering old information, which means that MPGAN can achieve continual image generation on MNIST and SVHN. It also indicates that SDP can assist MPGAN in producing samples with higher visual fidelity.

VI. DISCUSSION
This paper introduces MPGAN which combines GANs with the Controller and PWG modules to address catastrophic forgetting in GANs by parameter regularization methods. Throughout experiments firstly it is demonstrated that MPGAN is applicable to many different parameter regularization methods such as EWC, MAS and the proposed parameter regularization method SDP. The previously regenerated samples are clear and diverse in MPGAN. ACCs of them are always more than 80%. FIDs are all below 81. BWTs are almost approximately -3 to 0. Then experimental results also indicate that SDP can assess parameter importance in a more accurate manner and thus can overcome forgetting more significantly. MPGAN augmented with SDP always outperforms MPGAN enhanced with EWC and MAS by improving approximately 2%-5% on ACC, 3%-13% on FID and 31%-74% on BWT respectively, and it even outperforms some state-of-the-art methods e.g., DGR and LLG. In addition, MPGAN using parameter regularization methods occupies less time and memory than rehearsal mechanisms such as DGR and transfer learning such as LLG.
Although MPGAN is able to overcome the problem of catastrophic forgetting in the experiments above, it may not succeed in doing so with other datasets. Because parameter regularization methods assume that the data distribution of images in different tasks can be united into one joint distribution, and they attempt to reduce the representational overlap among tasks by use of a weight freezing or weight consolidation strategy. However, there are three problems in that assumption. First, when the data distributions between tasks are largely different, their distributions may have little overlap. This indicates that there exists no one set of parameters configuration that can fit the data distributions of both the old task and the new task. Second, when the model is updated by the new data, it would be restricted to finding a local optimum point only around the parameter space already optimized by old tasks, and it would cause other points with better joint distribution for both the new task and old task to be missed or ignored. Third, parameter regularization methods approximate the unified joint distribution by various approaches, such as Fisher information matrix in EWC and information entropy in the proposed SDP, but the results of these approximations are not very accurate. This means that the joint distribution obtained by various parameter regularization methods is not the true distribution and may contain egregious errors.

VII. CONCLUSION
In this paper, a framework called MPGAN to encapsulate various parameter regularization methods to address the problem of GANs' catastrophic forgetting when learning multiple consecutive tasks is introduced. In addition, a new parameter regularization method, called SDP, which estimates parameter importance in a more accurate manner through both the first derivative of output function and the additional second derivative of the output function is proposed. Experiments demonstrate that MPGAN can successfully overcome forgetting by means of many different parameter regularization methods, and our proposed SDP outperforms EWC and MAS in calculating parameter importance in MPGAN. However, further improvements can still be made on making MPGAN capable of more complex continual tasks such as continual style transferring [34] and continual face recognition [35].