Abstract

The development of object detection networks has reached a high point, and there have been significant improvements in accuracy and detection speed. Object detection is widely used in intelligent robots, self-driving cars, and other edge-intelligent terminals. Unfortunately, when a detector is allowed to learn new objects in an unfamiliar environment, it can catastrophically forget the objects it has already learned. In particular, reliable and stable knowledge cannot be extracted from old models. Based on this, a new multinetwork mean distillation loss function for open-world domain incremental object detection is presented. To better extract reliable and stable knowledge from old models, we enhanced the distillation output of the detector with a ResNet50 backbone and an output RoI head. The distillation output of the intermediate RPN is softened by adaptive distillation. To obtain more stable results, the ResNet50 backbone and RPN on the channel are zero-averaged. Various incremental steps and stability experiments are performed on two benchmark datasets, PASCAL VOC and MS COCO. The experimental results show the excellent performance of our method in different experimental scenarios, and it is superior to the most advanced methods. For example, in the setting of the batch task, incremental object detection on the PASCAL VOC and MS COCO datasets is improved by 3.4% and 2.1%, respectively.

1. Introduction

Object detection models, which are currently the most representative models for vision tasks, play a significant role in fields such as intelligent robot tasks [1], autonomous driving [2], and other edge intelligent terminal schemes [3]. However, existing supervised models can only be trained on labeled task data from existing categories in the training dataset. Furthermore, to adapt to a new task, the network parameters of the model need to be adjusted, and it is difficult for existing object detection models to adjust to dynamic real-world environments, causing them to forget old knowledge [4].

In this work, we investigate the problem of class-increment multinetwork object detection based on a catastrophic forgetting mechanism. In the incremental setting, task queues are introduced sequentially to the object detector, and a high-performance agent should maintain the old task performance during the new task learning process. Therefore, the model adaptive parameter update process that is executed when new task input is limited by setting knowledge distillation [4] at the model parameter level [57]. For Faster-RCNN [8] with a multinetwork structure, it is difficult to alleviate the catastrophic forgetting problem with the distillation of only a single network [9], and it is more effective to use multinetwork distillation [5, 10] to retain old knowledge for the whole network. Furthermore, existing incremental object detection methods based on multinetwork knowledge distillation make the model more prone to learning the old task and minimize the new task learning effect.

Based on these considerations, Peng et al. [10] proposed an incremental learning approach to multinetwork adaptive distillation, where distillation is set up in multiple networks and the teacher network is used as a lower bound for adaptive knowledge extraction. However, forcing a comparison of the outputs of the teacher network and the student network as an adaptive extraction condition may lead to significant loss caused by past knowledge being identified as more important for new task learning and zeroed out for distillation loss; the weights related to this output value may be equally important for the old task. In contrast, Joseph et al. [5] performed meta-learning by specifying an RoI head layer and setting a certain number of iterations to optimize the gradient update direction to better learn the new task. However, because the region proposal network (RPN) for the class is agnostic, the method does not include a distillation loss term. The correct rate of RPN classification and anchor regression of an object and the background of an old task directly affect the accuracy of RoI head prediction and degeneration of the old class in the next stage because the candidate regions learned by the RoI head are generated from the previous step after pooling. This results in a lack of candidate regions for the old task in the learning process of the RoI head, which affects the ability of the object detection model to recognize new and old classes. In addition, by changing the network parameters during training to fit the new task, a direct calculation of the distillation loss of the network output at each stage will cause the output of the new model to be very different from the output data distribution of the old model. This will make the network output difficult to fit and unstable.

To address the aforementioned challenges based on the Faster-RCNN incremental object detector, we propose a new distillation scheme for the Faster-RCNN detector. We improve the distillation output of the ResNet50 backbone at the input level and the RoI head network at the output level, and we use adaptive distillation to maintain the past knowledge of the RPN. Moreover, we adopt the meta-learning strategy in [5] to mitigate the degradation of model learning performance for new tasks caused by knowledge distillation. In addition, to address the problem of bias in the output data distribution of the new and old models, we perform zero averaging on the output data of the ResNet50 backbone and RPN of the new and old models to mitigate bias in the output data distribution of the new model. Consequently, the primary contributions of our work are as follows:(i)We propose a new scalable Faster-RCNN detector-based multinetwork distillation scheme that uses enhanced distillation values for the ResNet50 backbone and RoI head network and adaptive distillation for the RPN to mitigate the catastrophic forgetting problem.(ii)To alleviate the instability of the network output caused by the differences in the old and new network outputs, we perform zero averaging on the output of the backbone network and RPN of the old and new models and consider the RoI head network averaged over the class to produce a new set of distillation losses.(iii)We extensively evaluate the PASCAL VOC and COCO benchmark datasets and compare two advanced baseline methods. The experimental findings demonstrate the superior performance of our approach in various incremental scenarios.

Incremental learning is a special machine learning paradigm that simulates the human brain’s learning of sequential task streams, where the model can continuously learn new tasks and maintain old task performance. However, to maintain such properties, it is necessary to address the forgetting problem of the model due to new task learning [11]. On this basis, this section proposes incremental learning techniques for knowledge distillation and loss minimization.

2.1. Incremental Learning Approach for the Knowledge Distillation Strategy

Knowledge distillation [4] methods have been extended to mutual distillation learning [1214], assisted distillation learning [1517], spatial location distillation learning [18, 19], and dataset distillation learning [2022]. In addition, knowledge distillation can be used in incremental learning because of its ability to transfer knowledge from one model to another [813]. Knowledge distillation in incremental learning typically transfers old information from the teacher’s network to the student’s network to alleviate the forgetting of old knowledge. As a traditional incremental learning method for knowledge distillation, LwF [23] mitigates forgetting by freezing the old model as the teacher network, using a temperature factor to soften the softmax output of the logit, then adding the factor to the current task loss as a regularization term, and constraining the model to mitigate forgetting through parameter updates. However, LwF is vulnerable to a significant learning bias when there is an imbalance between the old and new classes. To address this problem, Zhao et al. [24] combined weight aligning (WA) with knowledge distillation by utilizing WA to balance the weights of the old and new class information in the final fully connected layer while using knowledge distillation to maintain the model’s discrimination of the old classes. In contrast, Dong et al. [25] used a dual-teacher distillation framework to mitigate the class imbalance problem using sampled unlabeled data to extract knowledge from the base class teacher and new class teacher models and transfer the knowledge to the student model. Similarly, Abdelsalam et al. [26] used a dual-teacher distillation strategy with regular and superclass teachers to solve the incremental implicitly refined classification (IIRC) problem. In feature relationship exploration, Yang et al. [9] explored important correlations with old and new classes in the feature space. They maintained the ability to learn to detect new classes by passing correlations to retain the close relationships within important learning knowledge. Similarly, Dong et al. [27] proposed the exemplar relationship distillation incremental learning (ERDIL) method, which mines exemplar relationship information in old tasks through exemplar relationship graphs (ERG) and uses graph relationship-based knowledge distillation to transfer old knowledge to CNN models for learning new tasks. The above knowledge distillation-based methods all distill the model structure. The distilled old model is used to constrain the new model to update the new task learning process and maintain the performance of the old and new tasks. Additionally, the multinetwork structure of the object detection model gives us inspiration for model structure distillation.

2.2. Loss Optimization Method for Knowledge Distillation

In studies on knowledge distillation loss, existing distillation loss optimization methods [2832] mainly optimize the deficiencies of the incremental learning processes by combining them with other techniques. Li et al. [30] prevented the features extracted from the intermediate neural network layers from changing drastically by adding feature distillation loss terms and minimizing the feature differences using a smoothed L1 loss function. EEIL [28] combines cross-entropy and distillation loss into an end-to-end learning network, using cross-entropy to learn new classes and distillation to retain knowledge corresponding to old classes. Xiang et al. [29] proposed a dynamic correction vector algorithm that combines representational memory and knowledge distillation loss to optimize cross-entropy and knowledge distillation loss functions to alleviate knowledge distillation bias and model overfitting problems. Douillard et al. [33] combined representation learning with distillation to mitigate the impact of feature extraction network changes by using a multiagent classifier through spatially based distillation loss-constrained representation evolution. To address the old-new data imbalance problem, Wu et al. [31] proposed a BiC algorithm for large-scale data processing based on distillation loss to correct old-new class bias. Similarly, Hou et al. [32] combined cross-entropy loss, feature-based distillation loss, and marginal ranking loss, which separates old and new classes, to mitigate the adverse effects of class imbalance. For the object detection problem, ILOD [9] uses knowledge distillation to regularize the output of the final classification and regression layers to retain the performance of the old task. Chen et al. [6] used cue learning to maintain the initial model feature information and added it to the distillation loss calculation while setting the confidence loss to extract the confidential information of the initial model to mitigate further forgetting. In the detector feature space, Yang et al. [34] investigated the applicability of both old and new classes and set a distillation loss term for two-stage Faster-RCNN using three perspectives—channel-based, point-based, and instance-based perspectives. Notably, the introduction of knowledge distillation aggravates the model’s focus on new tasks and reduces their performance. Based on this, Peng et al. [10] applied adaptive distillation to multiple networks in the Faster-RCNN detector, using the teacher network as a lower bound and adaptively extracting knowledge to improve new task learning. In contrast, Joseph et al. [5] set the Warp loss to optimize the gradient update direction by specifying a layer of the RoI head network for meta-learning to make it better adapted to learning new tasks. In conclusion, the design of the loss function has optimization effects both on the degradation of the learning performance of new tasks caused by the introduction of knowledge distillation and on the alleviation of forgetfulness, prompting us to place a greater emphasis on our work on optimizing the distillation loss.

2.3. Gradient Mate Learning

In contrast to the aforementioned methodologies, contemporary scholars have focused their efforts on investigating the potential of meta-learning to facilitate enhanced computational efficiency in models. The initial work by Andrychowicz et al. [35] established the groundwork for the advancement of gradient meta-learning. Their proposal involved the automatic learning of hyperparameters for model optimizers by specifying particular optimizers. However, this approach poses difficulties in the selection of suitable optimization algorithms and parameter settings. Furthermore, model-agnostic meta-learning (MAML) [36] is a prominent method in gradient meta-learning that aims to enhance the initial model parameters for various tasks by performing gradient meta-updates on multiple tasks. This approach has garnered significant attention in the domain of few-sample learning. Nevertheless, the effectiveness of MAML is contingent upon the availability of high-quality task data and its sensitivity to hyperparameters, which impose certain restrictions. To address the challenges, Franceschi et al. [37] introduced differentiable convex optimization techniques in the field of meta-learning. This approach aimed to enhance the stability of the meta-update strategy and resolve the sensitivity issues associated with previous methods. In addition, Snell et al. [38] proposed a gradient-based meta-learning method utilizing a prototype network. This method proved effective in scenarios with limited training samples and overcame the challenges posed by sparse data through the concept of category prototyping. Similarly, Kedia and Chinthakindi [39] employed the reptile algorithm in meta-learning, combined with an inductive bias on pretrained weights, to enhance the generalization performance of the model. Notably, the meta-gradient of the reptile algorithm incorporates a gradient component that maximizes the inner product between different batch sizes from the same task, thereby facilitating greater adaptability to new tasks. Furthermore, Xu et al. [40] and other researchers have integrated reinforcement learning with gradient-based meta-learning techniques to enhance the effectiveness of deep reinforcement learning in large-scale applications. This is achieved by considering the payoff function as a parametric function with adjustable meta-parameters and addressing the optimization of task-specific objectives. Consequently, the integration of meta-learning has expanded the scope of incremental learning applications. Furthermore, Joseph et al. [5] introduced a meta-learning approach for incremental target detection. This method involves learning from intermittent data inputs. During the meta-learning process, newly acquired data are incorporated into the Faster-RCNN network, and the RoI head modules with unfrozen weights are adjusted to better align with the new task. This approach addresses the issue of performance degradation in the model when attempting to distill knowledge for new tasks.

Inspired by the above work, a common strategy for incremental object identification methods to retain acquired knowledge is to simulate considerable activation of the original model by minimizing the first-order distillation loss. A new Faster-RCNN-based multinetwork structure distillation strategy for object detectors is devised as a result. Consideration is given to the impact of the new task’s degraded performance because of knowledge distillation. In contrast, the problem of model output data distribution bias resulting from the instructor and student model tasks during learning is investigated.

3. Proposed Method

Incremental object detectors do not require all data classes to be available in advance. When new data are input, the unique structure of the detectors can prevent catastrophic instances of forgetfulness. Incremental object detection (iOD) is a commonly used approach for target detection, and it is characterized by its multinetwork structure. In this approach, a new model, referred to as the student network, is trained to learn a new task while keeping the weights of the previous model, known as the teacher model, fixed. This is achieved by setting the distillation loss, which helps mitigate the performance degradation of the student network on the new task caused by distillation. However, iOD’s method of directly calculating the distillation loss on the input parameters can result in an imbalanced data distribution, leading to instability in the model’s output. As shown in Figure 1, our method mitigates catastrophic forgetting via multinetwork knowledge distillation and experience replay. Specifically, we reinforce the focus on past tasks for the beginning and end of the model and consider the problem of poor RoI head training in the next phase due to the lack of RPN training on old tasks. We use adaptive distillation in the intermediate RPN phase. We employ adaptive distillation in the intermediate RPN stage to conditionally maintain the model’s focus on the old task suggestion area; moreover, during the knowledge distillation process, we average the input to improve the stability of the model output. Furthermore, to prevent knowledge distillation from overprotecting past tasks and limiting the learning of new tasks, we use the gradient preprocessing meta-learning method in [5]. Specifically, the model learning process can be divided into two distinct phases: the incremental learning phase and the fine-tuning phase. In the incremental learning phase, the model learns the image features of the task by optimizing the specified loss function. On the other hand, the fine-tuning phase involves further training the model using microdata. This process allows the model parameters to be adjusted to effectively adapt to both previous and new tasks. The process of incremental learning can be divided into two stages: the initial stage involves learning a new task (referred to as new task loss), while the learning process is constrained by the distillation loss to limit changes in model parameters related to previous tasks (referred to as multinetwork mean distillation loss). The second stage utilizes a meta-learning gradient matrix to adjust the direction of the model’s learning gradient (referred to as warp loss). In the succeeding exposition of the experimental findings, all experimental outcomes are refined, except those that lack a particular reference to the phase of incremental learning.

3.1. Problem Formulation

For a continuous task stream , task () is delivered to the object detector at moment t, where is composed of the incremental subtask set (i = 1, 2, , n; ) and task Tt is composed of the image dataset with labels at moment t. The images contain several objects from different classes, but the labels are only valid for objects in task . Moreover, we set the update rule of to be determined by θ. The parameter θ, which is used to define , is divided into the task parameter C and the warp parameter ; that is, , and . In terms of the learning process, the model learns the task parameter in the first stage and the warp parameter in the second stage.

The specific learning process can be described as follows: at time t, when there is a task , the object detector needs to learn the input picture , which can be regarded as an aggregate function. For two-stage Faster-RCNN, can be formulated as a set function consisting of a backbone network , regional proposal network , and RoI head ; i.e., .The input I is subjected to feature extraction by to generate the feature map F. uses these features to generate N candidate regions that may contain objects and the corresponding scores, and each candidate region is subjected to to calculate the probability of being assigned to one of the classes in task and to perform regression calculations on its border positions. For incremental object detection, it is challenging to maintain the old task performance in a continuous task stream T without accessing all the data, and the incremental target detector needs to consider the classification of multiple networks with border regression on old knowledge memory compared to the normal incremental classification problem, such as the classification regression problem of and the old class feature extraction problem of and on old knowledge in the Faster-RCNN mode. In our method, we employ a knowledge distillation strategy by freezing the past network model as the teacher network to guide the current task model as the student model. For the purpose of subsequent theoretical elaboration, elements labeled with “te” are defined as elements related to the teacher network, such as the teacher goal detector , and elements labeled with “st” are defined as elements related to the student network, such as the student goal detector .

3.2. New Task Loss

The learning of new tasks by the model can be viewed as the application of a loss function to learn model parameters. Specifically, the object detector uses the loss function to minimize the classification error and the bounding box positioning error for learning. Let p = (p0, …, pk) denote the predicted probability of classes (k real classes and 1 background class); l= (; ; ; ) denotes the predicted bounding box position after pooling features for each RoI. The true labels are (true class) and (true class bounding box position), and is defined as follows:where is the log loss of the predicted class versus the true class, and is the smooth L1 loss; when  = 0, i.e., the background, the bounding box regression loss does not need to be calculated. Similarly, the training loss yields a prediction score , which indicates whether the selected region contains instances and corresponds to the bounding box prediction r. This loss is defined as follows:where indicates whether the region contains real labels; if the region contains real labels,  = 1, and it is 0 if the region does not contain real labels. is the real bounding box regression target. The weighting parameter is set to 1 in all subsequent experiments.

3.3. Multinetwork Mean Distillation Loss

Similar to the way new tasks are learned, our model retains the performance of previous tasks, while new tasks are learned by calculating the mean distillation loss of multiple networks. Similar to Faster ILOD [10] and iOD [5], we use knowledge distillation to soften the softmax output by inserting a temperature factor T into the log output in equation (3) to maintain the model’s performance on past tasks in a continuous task stream. However, unlike Faster ILOD, which uses multinetwork adaptive distillation, and iOD, which only distills the feature map and RoI head, we strengthen the distillation output at both ends and to ensure the accuracy of backbone feature extraction at the very first input and RoI head detection at the final output. On the other hand, to ensure that the anchor of the input contains past memory, we use adaptive zero mean distillation in the middle RPN layer to alleviate the overprotection of past knowledge while preserving the RPN’s memory of past knowledge by adaptively increasing the distillation loss.

In addition, we consider that the direct introduction of knowledge distillation will cause the new model to update the network parameters adaptively during the training process to adapt to the new task, resulting in a large deviation of the output from the output data distribution of the old model, which makes the network output difficult to fit and unstable. Therefore, we first apply zero-mean filtering to the inputs of and to obtain the new outputs before calculating the distillation loss. Specifically, the zero mean is obtained by subtracting the mean of all pixels for each pixel , as shown in equation (4). The model output after zero averaging has all pixel points distributed with the origin as the center point, preventing the situation where the data distribution is all negative or all positive at a certain time while keeping the original data distribution’s shape and the mean value of all pixels after zero averaging zero. This makes the model output more stable by facilitating the convergence of the model weights during the back-propagation process.

Therefore, to retain the model performance on the previous task during the learning of the new task, we perform multinetwork distillation on the backbone network, RPN, and RoI head network and add a mean value strategy to remember past information.

3.3.1. Backbone Distillation Losses

F is the layer containing the extracted object pixel features from the image, and F contains each feature pixel . To obtain the object features associated with the old and new classes, a distillation loss constraint needs to be applied to . We learn by freezing the weight parameters of and using the teacher network to teach the student network. For the same input I, the teacher network and the student network obtain outputs and , respectively. Furthermore, serves as input to the subsequent classification and regression steps, and an accurate description of the old and new class features is particularly important for the subsequent steps, so we strengthen the distillation of . In addition, for faster convergence, we obtain and by zero averaging the features obtained by equation (4). Distillation loss is defined as follows:where and are defined as each of the pixel features in and , respectively, , and .

3.3.2. RPN Distillation Loss

suggests regions r = (r1, r2, …, rj) for the old and new class features that I extracted by in the previous stage and determines whether the corresponding region has a class score o = (o1, o2, …, oi). As the first stage of the two-stage target detector, whether the proposed region network extracted by contains old and new classes is particularly important for the next stage of in the classification and regression of the old and new classes; however, if the distillation constraint on is excessive, it will lead to an increase in ’s focus on the old class member and affect the learning process of the new task, so we adopt the idea of Peng et al. [10], who suggested using the teacher network as a lower bound to adaptively choose whether to apply distillation constraints to . In contrast, we subject the output scores from and to distillation softening in equation (3) and zero averaging in equation (4) and use the KL dispersion loss as the classification loss. We determine the value on each dimension regarding l for the anchor regression of , and we regulate the regression by setting a threshold . We take the empirical value  = 1 and use the sum of the o value of over and as the activation value in the distillation loss calculation; however, higher values of may be more important for new task learning and are therefore not involved in the distillation loss calculation. The definition of RPN distillation loss is as follows:where and are the scores of the output of successively after the zero averaging treatment of equation (4) and the distillation softening treatment of equation (3), respectively, and the empirical value is T = 6 in our experiments. is the total number of anchors.

3.3.3. RoI Head Distillation Loss

generates proposals for the old and new classes passed through pooling to obtain the final classification probabilities () and border regression values () for the teacher and student networks. In our approach, we focus more on the classification and regression of the old classes, considering as the final stage of the two-stage object detector. In addition to the normal distillation loss calculation, we calculate the mean of each channel with respect to the class by equation (7) to increase and to focus on the overall trend of the final classification probability and border regression. The temperature factor T is also introduced into the log output of equation (3) to soften the “softmax” output and obtain more information about past knowledge.where denotes the number of channels and i denotes the i-th parameter of the j-th channel. The loss is therefore defined as follows:where , , , and are all variables processed by the mean value of equation (7). The parameters labeled with T are all variables processed through equation (1).

3.4. Total Mission Loss in the First Phase

The first stage of our model’s learning process for the task parameters can be characterized as learning the current task through each stage, followed by correcting the model parameters to maintain past task performance through the distillation loss in each stage. This allows the loss of the overall task to be defined as a linear combination of the loss of the new task and the multinetwork mean distillation loss. To balance the model performance in the past and present tasks, we employ a convex combination similar to that in [5] to set the stability and plasticity trade-off parameter α. Here, the total loss in the first stage is defined as follows:where is defined as 0.1 in our experiments and is defined in Section 4.3.

3.5. Gradient Matrix Warp Loss

For the second phase of learning, the warp parameter , as depicted in Figure 2, is configured in the network warp layer in to learn the preprocessing matrix (). By establishing an image store, a small number of images are saved for each class during the distillation learning procedure . Images are taken from and put in the set feature store .

In the distillation learning process, a small number of images are stored for each class by setting up an image store , and the images stored in are stored in the set feature store after feature extraction by . Notably, defines a fixed size queue for each class to mitigate the class imbalance problem. The stored queue characteristics are incorporated directly into the task learner by utilizing the meta-learning parameterization of , which warps the gradient toward the steepest direction and enables the parameters to be updated in the most suitable direction for different learning tasks.

Each image in is passed through and to generate the RoI pooled features and associated labels, which are then queued into . Let f be the RoI pooled features, where f generates the predicted classification value p and the border prediction l through . The warp loss can then be calculated from the features and labels stored in , and is calculated as follows:where is the log regression loss and is the smooth L1 loss.

4. Experimental Analysis

4.1. Datasets and Evaluation Metrics

We evaluated our method on the PASCAL VOC 2007 [41] and MS COCO 2014 [42] datasets. PASCAL VOC 2007 contains 9963 images with 24640 instances of annotations and 20 classes. According to the setup in [41], 50% of the dataset is divided into training and validation sets, and the rest is used for testing. MS COCO 2014 contains objects from 80 classes with 83,000 images in the training set and 41,000 images in the validation set.

For the evaluation metrics, the average accuracy of the 50% IOU threshold (mAP@50) was used as the main evaluation metric for both datasets. For MS COCO, we set multiple IOUs (AP, AP50, and AP-0.75) and sizes (APs: small, APm: medium, and APl: large) as evaluation metrics.

4.2. Incremental Experimental Scenario Setting

Similar to [5], we simulate incremental scenarios for PASCAL VOC and MS COCO, where the dataset provides a set of selected classes C to be used for task Tt and passed to the learner at moment t. For each image in the dataset that may contain multiple classes, one or several classes belonging to C will be learned as the task object class, and those instances of classes that do not belong to C will not be marked for learning.

According to the different difficulty levels of the classification task, we considered the effect of the learning intensity of initial base class tasks and incremental tasks on the model output results and defined class flow tasks and batch tasks. Class flow tasks can be interpreted as having incremental tasks Ti that flow into the model after learning the base class task T0 with 1 to 2 additional classes per task, whereas batch tasks contain only one incremental task with 1 to 2 additional classes. As indicated in Table 1, we devised seven incremental scenarios based on the degree of difficulty of the incremental experiment scenarios according to the divided class flow tasks and batch tasks. The dataset used in experiments a to f includes the first 20 classes of PASCAL VOC and the dataset used in experiment includes 80 classes from the MS COCO dataset.

4.3. Discussion and Analysis
4.3.1. Stability Analysis of the Zero Mean

To validate the influence of the zero mean on the stability of the model output, we conducted five consecutive replications of experiments (d) through (f) and created box plots, as depicted in Figure 3, based on the output data. As seen from the figure, in experiment (d), our method has a more obvious advantage; its lowest mAP50 value is 64.7%, but this value is higher than the highest value of 63.63% obtained by the iOD method. In experiment (e), our approach’s stability is not only comparable to but substantially superior to that of the iOD method, which exhibits an outlier (62.68%). In experiment (f), the gap between our method and iOD in terms of stability is wider, and the stability of our model is still better than that of iOD, although the maximum value of 68.28% obtained by iOD is higher than the lowest value of 68.18% obtained by our method in terms of accuracy. This indicates that in the incremental task containing only one class, the gap between different methods fluctuates slightly, but the results from Figure 3(c) show that our method still outperforms iOD in terms of overall accuracy. The results of the stability experiments reported in Figure 3 show that our method outperforms the iOD method in terms of output stability for all iOD methods without a zero mean, and the overall accuracy of the five experiments in all three scenarios is higher than that of the iOD method, which fully demonstrates the reliability and stability of our method.

To search for the optimal stability equilibrium parameter α of equation (9), we conducted several experiments on the value of α based on the incremental task approach of experiment (d). Table 2 displays the individual experimental outcomes. The table demonstrates that as α gradually increases, our model’s experimental accuracy gradually declines from 65.0% to 60.7%. Based on the experimental results, we ultimately set the value to 0.1 in all tests.

4.3.2. Ablation Experiment

To confirm the increase in accuracy of the approaches introduced by our method (the RPN and zero mean), we carried out ablation experiments. As shown in Table 3, we still used the task form based on the incremental experiment (d). The results of the experiment are the mAP values of the base class (first 10 classes), the incremental task (last 10 classes), and the overall 20 classes. As seen from the table, when only zero averaging is introduced, the learning ability of the new task is improved, and the mAP of reaches 67.33%. When RPN adaptive distillation loss is introduced, the stability of past knowledge is improved, and the mAP of reaches 62.20%. When both strategies are implemented, the highest mAP values are attained for the new task , and all 20 classes (68.28% and 65.00%, respectively), and the experimental results are consistent with the conclusions of our theoretical analysis in Section 3.3. In regard to the remaining two approaches [9, 10], it is worth noting that Faster ILOD [10] and the method by Shmelkov et al. [9] exhibit a comparative advantage in preserving performance on previous tasks. However, in practical production scenarios, emphasis is placed on the significance of new tasks. Consequently, our proposed method not only enhances the model’s performance on new tasks but also maintains a certain level of performance on previous tasks. This improvement results in a more substantial enhancement of overall task performance compared to the aforementioned approaches [10, 11], with superior performance demonstrated on all tasks. Furthermore, the results from Experiment 3 and Experiment 4 in Table 3 demonstrate that the inclusion of the zero mean in Experiment 3 leads to a decrease of approximately 0.5% in T0 accuracy in Experiment 4, while the T1 accuracy shows an improvement of approximately 2%. This improvement can be attributed to the utilization of the RPN adaptive distillation loss, which enhances the model’s focus on the old task (T0: 62.20% in Experiment 3). The introduction of zero mean learning causes the model to focus on the new input data, thereby allowing the model weights to be better adapted to the new task during distillation computation, resulting in a 2% enhancement in the performance of the new task. However, importantly, the performance of the old task is still maintained to some extent, albeit with a decrease of 0.5%. In comparison to Faster ILOD [10] and the model by Shmelkov et al. [9], our method achieves superior performance in T1, surpassing them by 13.81% and 5.14%, respectively. In addition, our method outperforms both approaches in terms of overall performance, surpassing them by 2.88% and 1.85%, respectively.

To provide additional evidence for the effectiveness of our approach, we conducted a more detailed analysis of the incremental learning phase in the ablation experiments (refer to Table 4). The results indicate that the average precision (AP) values obtained by the model for identifying the old classes during the incremental class learning phase exceed 9.09. This can be attributed to the model’s ability to generate probability distributions that are highly similar for these classes while still exhibiting subtle differences that enable partial identification of the old classes. This phenomenon may arise from the inherent limitations of the student model in accurately replicating the probability distribution of the teacher model during the knowledge distillation procedure. The student model can only approximate the output distribution of the teacher model. Hence, as depicted in Table 4, the incorporation of RPN adaptive distillation (Experiment 3) during the incremental learning phase enables the model to effectively recognize the old classes, resulting in the identification of 9 classes. Moreover, the overall accuracy of recognizing the old classes is enhanced by 2.83% compared to the mAP value obtained in Experiment 1, incorporating the zero mean (Experiment 4) results in an enhancement of the model’s performance on the new task, reaching 72.27. Moreover, the performance on the old task experienced only a marginal decrease of 0.11%. This outcome indicates that our approach successfully achieves a better trade-off between plasticity and stability, effectively addressing the requirements of both maintaining performance on the old task and facilitating adaptation to the new task.

4.4. Analysis of the Experimental Results of Incremental Object Detection

We employ stochastic gradient descent (SGD) with 0.9 momentum. The initial learning rate is set to 0.02 and is then decreased to 0.0002. Each job receives 18,000 iterations of base-class training on the PASCAL VOC dataset, followed by 100 iterations per image and a total of 90,000 iterations for each of the two tasks. For 2080Ti, the model training procedure is executed on a single GPU, and since each GPU simultaneously analyses two images, the batch size is two. The and queue sizes of the feature store and image store are set to 10. The evaluation process considers 100 detections per image, and the NMS threshold is 0.4. The coefficient of stability α is 0.1.

4.4.1. Class Flow Tasks

Incremental simulations, in which the model learns the top 10 or first 15 classes from the PASCAL VOC dataset as base classes, are performed, and the detector is fed one or two classes at a time. Tables 46 show the experimental results for the class flow task. The first row displays the joint learning of 20 classes as the incremental learning upper bound; the second row displays the model learning the base classes, where the base class is the first 10 classes in experiments (a) and (b) and the first 15 classes in experiment (b); and the following rows display each class in turn according to its ordinal number in the class flow task. The table shows the change in mAP values for all classes as well as the AP values of each class incrementally for each task. Figure 4 depicts the trend in the effects of each incremental task carried out by the model during the class flow incremental task on the base class, the old class, and all classes.

As seen from the data in Table 5, in the increment scenario of experiment (a), we set the base class  = 10, and our mAP value is higher than that of the iOD method for all class increment processes. However, the overall task mAP value gradually decreases with the input of at each increment step when  = 1. The largest mAP difference can reach 3.6%, while the average mAP value is 2.36% larger than that of the iOD as the class increments increase. The unique changing process of experiment (a) is depicted in Figure 4(a). As shown, the difference in mAP between our model and the base class, old class, and all classes steadily widens as is added, demonstrating that our model is more stable in the case of an incremental task flow. In the incremental scenario of experiment (b), we enhanced the complexity of the incremental task by setting to 2. Table 6 shows that as the difficulty of the incremental task increases, our model has a clear advantage over experiment (a) in terms of the overall accuracy, with an average mAP difference of 4.5%. In the details of experiment (b) depicted in Figure 4(b), the gaps in mAP between our model and experiment (a) for the base class, old class, and all classes are more pronounced. Specifically, in learning the task, the all-class gap reaches 5.1%. In comparison to experiment (a), it can be observed that as the incremental task learning difficulty grows, the mAP value of the model for the overall task declines at a faster rate. Experiment (c) enhances the difficulty of learning basic classes by increasing to 15 and to 1. When the number of base classes learned increases to 15 classes, the partial gap of our models’ mAP over that of iOD gradually increases with the learning of new classes, reaching a maximum of 3.9% in Table 7. In the details of experiment (c) shown in Figure 4(c), the mAP of our model is better than that of iOD for the base class, old class, and all classes. In the class flow task experiment, we can observe that relative to  = 1 in experiment (a), when the difficulty of learning is increased, as in experiment (b), the model learning effect decreases significantly for each learned , as depicted in Figure 4(b); however, the final mAP of the all-class scenario is comparable to that obtained in experiment a and remains at 46.9%. Compared with  = 10 learned classes in experiment (a), increasing the base class task learning difficulty, as in experiment (c), decreases the learning effect of the very first task while keeping the number of total tasks and learned classes constant. However, as the incremental task stream decreases, the model’s final learning effect is better than that of the class stream with many tasks, as in Figure 4(c), and it remains at 54.8%.

4.4.2. Batch Tasks

In the batch task settings, we considered class batch learning of the model on the PASCAL VOC and MS COCO datasets with different numbers of base classes and incremental classes. validates the accuracy of our method.

Table 8 shows our results on the COCO dataset for the experiment (). Specifically, we set up incremental scenarios with  = 40 and  = 40 and used the standard COCO dataset evaluation method with multiple IOU metrics (AP, AP50, and AP-0.75) and sizes (Aps: small, Apm: medium, and Apl: large) for a comprehensive evaluation. As seen in Table 9, our model continues to show excellent performance even for the high-volume class learning scenarios in the complex COCO dataset. It outperforms the iOD technique by more than 2% across all scales of evaluation, and its AP50 performance is 4.7% greater than that of the iOD method.

On the PASCAL VOC dataset, we report the results compared with those of the model by Shmelkov et al. [9], Faster ILOD [10], and iOD [5] in terms of mAP, while on the MS COCO dataset, we compare our results with those of iOD [5] with the standard COCO dataset evaluation method. Tables 911 show the experimental results of our comparisons.

In experiment (d), we set up a batch task increment scenario with :  = 10 : 10. Table 9 reveals that our model achieved the best learning effect in all experiments compared to the other methods; the mAP score reached 65.0%, and the best learning effect was that on the new tasks, where mAP reached 68.3%. Our method is also superior to iOD in the maintenance of old task performance, with the old class mAP reaching 61.0%.

In experiment (e), we adjusted the class learning ratio of and (:  = 15 : 5). In the results of experiment (e) shown in Table 10, our method is slightly inferior to the methods in [9, 10] in terms of overall task mAP, but it is superior to the most recent method, iOD, in terms of retaining the old task performance and new task learning effect. It obtained 64.5% overall task mAP, 66.9% old task mAP, and 64.5% new task mAP.

In experiment (f), we increased the number of classes learned in to 19 classes. In the results reported in Table 11, our model is comparable to the other methods in overall task learning and old-class task performance, but it is optimal in terms of mAP. It obtains 68.9% mAP for the overall task and 68.9% mAP for the old-class task.

The class flow task experiments and batch task experiments demonstrate that the smaller the number of classes learned by in the incremental task phase, the smaller the fluctuation of the model on the total mAP value during each new task learning process is, where the fluctuation is the smallest for the scenario when  = 1. In contrast, when the number of classes learned by increases, the gap in the overall mAP score of various methods increases significantly, and our method significantly outperforms the other methods in various incremental scenarios in the experiment.

4.4.3. Time Performance Analysis

Table 12 presents a comparison of the training time between our method and the iOD method during the incremental learning phase. The results indicate that our method requires slightly more time for training, primarily due to the increased computational parameters involved in the incremental learning phase. However, the difference in training time between the two methods is relatively small. Notably, our method achieves a higher mean average precision (mAP) for the performance on both the old and new classes compared to the iOD method, with an improvement of 4.4%. In future studies, we will continue to conduct additional investigations to decrease the time necessary for model training while maintaining the assurance of optimal model performance.

5. Conclusion

The catastrophic forgetting problem is mainly addressed by knowledge distillation in existing target detection models; however, object detection models with multiple network architectures require distinct types of distillation procedures. In this work, we present a novel multiple networks mean distillation method for object detection that uses zero averaging to process the model output parameters. Then, it adds the parameters to the distillation loss to further improve the model output stability while strengthening the distillation loss at the input and output sides of the model network structure and adaptively distilling the intermediate network structure to better obtain accurate outputs. We combine meta-learning with a multiple network mean distillation method. On the two basic datasets, we set up numerous incremental tests, and the outcomes show that our model performs better than the alternative comparison models.

Data Availability

All data are generated by relevant algorithms. If you need to reproduce the experimental results, please contact the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Jing Yang was responsible for conceptualization, methodology, investigation, formal analysis, writing the original draft, and revising the draft. Kun Yuan was responsible for data curation, methodology, and writing the original draft. Suhao Chen was responsible for resources, methodology, and supervision. Qinglang Li performed revision and checked the draft. Shaobo Li was responsible for visualization and investigation. Xiuhua Zhang was responsible for resources, supervision, formal analysis, revision, and checking the draft. Bin Li was responsible for methodology and validation.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant no. 62166005), the Guizhou Provincial Key Technology R&D Program (Grant nos. QKH[2023]368, QKH[2022]003, and QKH[2021]335), Developing Objects and Projects of Scientific and Technological Talents in Guiyang City (Grant no. ZKHT[2023]48-8), Joint Open Fund Project of Key Laboratories of the Ministry of Education (Grant nos. QJH[2020]245 and QJH[2020]248), and the Guizhou Provincial Science and Technology Projects (Grant no. PTRC[2020]6007-2).