A Survey on Few-Shot Class-Incremental Learning

Large deep learning models are impressive, but they struggle when real-time data is not available. Few-shot class-incremental learning (FSCIL) poses a significant challenge for deep neural networks to learn new tasks from just a few labeled samples without forgetting the previously learned ones. This setup easily leads to catastrophic forgetting and overfitting problems, severely affecting model performance. Studying FSCIL helps overcome deep learning model limitations on data volume and acquisition time, while improving practicality and adaptability of machine learning models. This paper provides a comprehensive survey on FSCIL. Unlike previous surveys, we aim to synthesize few-shot learning and incremental learning, focusing on introducing FSCIL from two perspectives, while reviewing over 30 theoretical research studies and more than 20 applied research studies. From the theoretical perspective, we provide a novel categorization approach that divides the field into five subcategories, including traditional machine learning methods, meta-learning based methods, feature and feature space-based methods, replay-based methods, and dynamic network structure-based methods. We also evaluate the performance of recent theoretical research on benchmark datasets of FSCIL. From the application perspective, FSCIL has achieved impressive achievements in various fields of computer vision such as image classification, object detection, and image segmentation, as well as in natural language processing and graph. We summarize the important applications. Finally, we point out potential future research directions, including applications, problem setups, and theory development. Overall, this paper offers a comprehensive analysis of the latest advances in FSCIL from a methodological, performance, and application perspective.


Introduction
In recent years, significant advancements in computing technology and the widespread availability of large-scale datasets have enabled deep neural networks (DNNs) to make remarkable progresses in various computer vision tasks (He et al., 2016;Krizhevsky et al., 2017).However, many of these successes rely on idealized assumptions and massive amounts of available training data, which may not accurately reflect the real-world scenarios where high-quality data is often scarce.For instance, in scenarios where data arrives incrementally in batches and newly added categories contain very few samples, many existing methods prove to be ineffective.
The goal of few-shot class-incremental learning (FSCIL) is to endow AI with the capability to address the aforementioned challenges.This requires DNN models to learn new tasks incrementally from a small number of labeled samples, without forgetting the previously learned ones (Tao et al., 2020).Since Tao first proposed the concept of FSCIL in Tao et al. (2020), many scholars have extended it to various application scenarios beyond visual tasks because it conforms to human learning patterns and is suitable for real-world applications.
An intuitive method for FSCIL is to fine-tune a base model on a new training set.However, it would lead to catastrophic forgetting (McCloskey and Cohen, 1989) and overfitting, corresponding to two core challenges: the stabilityplasticity dilemma and unreliable empirical risk minimization.

Stability-plasticity dilemma
The stability-plasticity dilemma reflects the contradiction between stability and plasticity.Stability means that a neural network should maintain its learned knowledge and resist changes caused by new inputs.Conversely, plasticity means that the network should have the ability to adapt to new inputs or tasks.Catastrophic forgetting can be seen as a manifestation of the stability-plasticity dilemma.In incremental learning (IL), an overly stable model might fail to learn new tasks or data effectively.In contrast, an exceedingly plastic model might rapidly lose information about previously

6WDELOLW\ 3ODVWLFLW\
Figure 1: (a) Stability and plasticity cannot be achieved simultaneously.When a model has high stability, it performs well on old data but struggles with new data.As plasticity increases, the model demonstrates enhanced generalization on new data while gradually forgetting old data; (b) Given a hypothesis space H and initial parameters h θ , ĥ is the function that minimizes the expected risk, h * is the function in H that minimizes the expected risk.h f and h s correspond to the functions of minimizing the empirical risk when data samples are few and sufficient, respectively.When the data is sufficient, ERM yields results closer to h * .
learned tasks or data.See Fig. 1 (a) for more details.

Unreliable empirical risk minimization
In traditional machine learning frameworks, empirical risk minimization (ERM) aims to optimize the average loss on training data.This strategy works well in large-scale data environments where there are enough samples to ensure statistical consistency during training.However, in the context of few shot learning (FSL), this strategy faces a challenge known as the unreliable empirical risk minimizer problem (Wang et al., 2020b).The core of this problem lies in the fact that when the number of training samples is limited or when there is noise in the samples, the ERM strategy may lead to overfitting.Overfitting means that the model performs well on the training data but has poor generalization performance on new, unseen data.This shortfall arises because limited data may not fully represent the true distribution of the entire data generation process, causing the model to capture random noise in the data rather than the underlying true patterns.Fig. 1 (b) shows that when training samples are insufficient, the ERM function cannot accurately approximate the optimal expected risk minimization function.
FSCIL, needs to overcome these two challenges, is even more difficult.In addition to the challenges mentioned above, due to the large difference in the number of samples between old and new categories, the model tends to bias towards the larger set of old-class training samples during training or prediction, and the imbalance between base and novel class samples also makes it difficult for the model to learn new categories (Chen and Lee, 2021;Hou et al., 2019a;Tao et al., 2020).
Although FSCIL has great potential in real-world applications and has gained significant attentions from researchers, it remains a relatively underexplored area, with a lack of comprehensive reviews.Existing reviews primarily focus on either FSL or IL separately, rather than their combination in FSCIL.For example, Parisi et al. (2019) focus on continual lifelong learning, though much of the content may not reflect recent advancements.Wang et al. (2020b) introduced the theoretical foundation of FSL and classified FSL methods from different perspectives.Belouadah et al. (2021) provide a summary of Class-IL in visual tasks only.Zhou et al. (2023) summarized the latest progress in deep Class-IL from three aspects: data, model, and algorithm.
Our contributions to the field of FSCIL can be summarized as follows: 1) We conducted an in-depth analysis of fundamental and applied research of FSCIL.Our comprehensive review explores various FSCIL approaches, highlighting their advantages, limitations, and performance on benchmark datasets.
2) We revisited the theoretical foundations and practical implementations of various FSCIL approaches and proposed a taxonomy of methods based on the underlying approach or technique.This framework provides a useful guide for researchers and practitioners working on FSCIL.
3) We evaluated the performance of various FSCIL approaches on benchmark datasets, providing insights into the strengths and weaknesses of different methods.
4) We discussed the potential applications of FSCIL in various domains, such as computer vision, natural language processing, and graph analysis.This analysis highlights the broad range of applications for FSCIL and its potential impact on these fields.

5)
We identified open research challenges and opportunities for future work in the field of FSCIL.This provides a roadmap for future research in the area and helps to guide the direction of future work.
The remainder of this paper is organized as follows.Section 2 introduces the problem definition of FSCIL and the relevant research background.Section 3 reviews the approaches and notable architectures used in FSL.Section 4 summarizes the existing FSCIL approaches, including traditional machine learning methods, meta learning-based methods, feature and feature space-based methods, replay-based methods, and dynamic network structure-based methods.Section 5 presents the performance of different FSCIL approaches on benchmark datasets.Section 6 discusses the applications of FSCIL in different domains.Section 7 outlines the future research directions in the FSCIL field.Finally, section 8 concludes the paper.

Problem definition
In supervised learning, we want to learn a function f ∈ F : X → Y that is able to predict the target vector y ∈ Y, for a given input sample x ∈ X.To do so, a model is fed with the training data with sufficient instances: D = {(x i , y i )} N i=1 , which contains independent and identically distributed samples from the distribution P (X, Y). to train this function f , we minimize the expected risk over the instance distribution P: where ℓ(•, •) captures the discrepancy between prediction and ground-truth label.However, the joint distribution P in unknown, therefore the learning algorithm actually aims at minimizing the empirical risk:

Problem formalization
Fig. 2 shows the form of dataset split and the way of FSCIL experiment setup.FSCIL task comprises a base session with sufficient training data and multiple incremental sessions with limited training data.The learning process within each session involves only the data relevant to the current task, while the model is also required to preserve the knowledge of previous tasks when acquiring new ones.The task is to train the model from a continuous data stream in a class-incremental form.
The FSCIL problem is defined as follows.Here we assume an m-step FSCIL task.Let D (0)  train , D (1) train , ..., D (m) train and D (0)  test , D (1) test , ..., D (m) test denote the training and testing data for sessions {0, 1, ..., m}, respectively.For session j, it has training data D In Eq. 3, the learning algorithm f should build the new model based on new dataset D ( j) train and current old model W j−1 , and minimize the loss over all seen classes.During testing, the model will be evaluated on all seen classes so far.For session j, its testing data D ( j) test has the corresponding label space of

Relevant learning problems
Few-shot Learning.Humans are very skilled at identifying a new object with very few samples.For example, a child can recognize what a "zebra" or "rhinoceros" is with just a few pictures from a book.Inspired by human's rapid learning ability, researchers hope that machine learning models can quickly learn new categories with only a small number of samples after learning a large amount of data for a certain number of categories.This is the problem that FSL aims to solve.In recent years, the concept of FSL has received widespread attention, and there have been many outstanding algorithm models in the field of image classification (Finn et al., 2017;Snell et al., 2017;Zhang et al., 2018).There are mainly three categories of FSL methods: fine-tune based, data augmentation based, and transfer learning based.
Considering a learning task T , FSL deals with a data set D = {D train , D test }.It consists of a training set D train = {(x i , y i )} I i=1 , where I is small, and a testing set D test = {x test }.Usually, one considers the N-way K-shot classification in which D train contains I = KN examples from N classes each with K examples.FSL is mainly a supervised learning problem (Wang et al., 2020b).Due to the small size of D train , the model bias, ε = |ε ex − ε em |, is too large, making it hard to learn a highquality prediction function f ∈ F : X → Y.
One-shot Learning.In the late 1980s and 1990s, some researchers already noticed the problem of one-shot learning.It was not until 2003 that Fe-Fei et al. (2003) formally introduced the concept.They believed that when there is only one or a few labeled samples for a new category, the previously learned old categories can help predict the new category (Fei-Fei et al., 2006).In the N-way K-shot paradigm, when N = 1, FSL is called one-shot learning problem.Since the settings are similar, it is not necessary to distinguish between the two concepts in most cases.
Zero-shot Learning.In the N-way K-shot paradigm, FSL becomes a zero-shot learning problem (ZSL) when N = 0. ZSL was first introduced by Palatucci et al. (2009).Since ZSL does not contain examples with supervised information, it recognizes new sample categories by utilizing semantic label attribute information in the absence of training samples.This approach is inspired by human learning and reasoning capabilities, allowing computers to possess transfer and reasoning abilities.Specifically, a training data for ZSL is formulated as S = {(x, y, a (y)) |x ∈ X S , y ∈ Y S , a (y) ∈ A}, where X S is set of image/features from seen classes, Y S is set of seen class labels, a(y) is semantic embedding for class y.The test set is formulated as Meta Learning.Meta learning is often understood as learning to learn.It is the process of extracting the experience of multiple learning episodes and using this experience to improve future learning performance (Hospedales et al., 2022).Meta learning is usually divided into two stages.In the meta-training stage, the model is trained using multiple source (or training) tasks to obtain initial network parameters with strong generalization ability.In the meta-testing stage, the settings of the new tasks are the same as those of the source tasks, but these samples have not been seen during the training process.Each task in the training tasks or testing tasks is divided into a support set and a query set.Meta learning has wide applications in the fields of computer vision, reinforcement learning, and architecture search.Meta learning is naturally suitable for FSL, and many studies have used meta-learning as a means of FSL, enabling the model to learn from a small number of new task samples (Elsken et al., 2020;Jamal and Qi, 2019;Ren et al., 2018).
Transfer Learning.Transfer Learning (Zhuang et al., 2020) focuses on the transfer of knowledge across different domains, enabling the transfer of knowledge from domains/tasks with abundant training data to novel domains/tasks with scarce training data.Its definition is as follows.
Definition 1. Transfer learning.Given a source domain D S and a corresponding task T S , a target domain D T and a corresponding task T T .The primary aim of transfer learning is to leverage the knowledge obtained from D S and T S to enhance the learning performance of D T and T T , where D S D T or T S T T (Pan and Yang, 2010).
The key to successful knowledge transfer is the presence of a connection between the two learning activities.If there are few commonalities between domains, knowledge transfer may fail and have a negative impact on the new task.In everyday life, people engage in many instances of transfer learning, such as learning to ride a bike, which makes it easier to learn how to ride a motorcycle.Transfer learning can reduce the reliance on large amounts of target domain data when constructing learning machines.As a result, it has broad applications in zero-shot and few-shot domains, including style transfer, feature space transfer for data augmentation, and label-efficient learning of transferable representations across domains (Azadi et al., 2018;Liu et al., 2018;Luo et al., 2017).
Incremental Learning.The definition of IL can also be expressed using Eq. 3, but the difference from FSCIL is that there are plenty of samples for each incremental category.IL is also known as continuous learning, lifelong learning, or neverending learning, is a field of machine learning that is gaining increasing attention.It is typically used to address the problem of catastrophic forgetting, where performance on previously learned tasks deteriorates sharply after learning new tasks.The ability of IL is to continuously process a stream of information from the real world while retaining, integrating, and optimizing old knowledge at the same time.The methods proposed in IL are broadly categorized into three categories: replay-based methods, regularization-based methods, and parameter isolation methods (De Lange et al., 2021).Van de Ven and Tolias (2019) proposed three scenarios for IL, including Task-IL, Domain-IL, and Class-IL.And Class-IL is considered the most difficult one since the newly added classes often exhibit high similarity with the already learned classes.Currently, only replay-based methods produce acceptable results for Class-IL.

Variants of few-shot class incremental learning
Generalized few-shot incremental learning.Before the emergence of FSCIL, similar settings had been proposed in previous research, such as those presented by (Gidaris and Komodakis, 2018;Qi et al., 2018;Xie et al., 2019;Yoon et al., 2020).These studies introduced Generalized Few-Shot Incremental Learning (GFSIL).Specifically, a pre-trained model will learn new classes with limited instances.The goal of GFSIL is to maintain classification performance for both old and new classes.However, GFSIL only has one incremental phase, and its data partitioning format is different from FSCIL.For example, CIFAR-100 is randomly divided into 40, 10, and 50 categories, which serve as the meta-training, meta-validation, and meta-testing sets respectively.GFSIL is considered less challenging than FSCIL.To address the challenge of GFSIL, Qi et al. (2018) proposes a solution that utilizes the average feature initialization method with few shots to initialize new class representations.Meanwhile, Gidaris and Komodakis (2018) introduces dynamic few-shot learning to avoid forgetting, which employs a novel attention-based weight generator for few-shot classification.The dot-product calculation method is replaced with the cosine-similarity function to incorporate the few-shot classification weight generator into the recognition system.Ren et al. (2019) proposes an Attention Attractor Network to regulate the learning of novel classes.Additionally, Yoon et al. (2020) suggests a method for fusing base features, while Ye et al. (2021) puts forward the idea of synthesizing few-shot classifiers with a shared neural dictionary.Xie et al. (2019) introduces Meta Module Generation (MetaMG) which utilizes metalearning to learn a set of meta-modules, which are small neural networks that can be quickly adapted to new tasks.During the IL process, the MetaMG approach uses the learned metamodules to generate task-specific modules for new classes.
Few-shot incremental learning.Similar to FSCIL, Ayub and Wagner (2020a) examines the problem of few-shot incremental learning (FSIL) and proposes a cognitively-inspired approach.They represent each image class as a centroid.In the experimental setting of FSIL, the number of classes for both base and incremental is the same, which differs from the rich base data setting in FSCIL.Additionally, in order to tackle the issue of the inability to learn from data streams in ZSL, Wei et al. (2020Wei et al. ( , 2021) ) have proposed the concept of incremental zero-shot learning (IZSL).Unlike traditional ZSL, IZSL involves multiple learning phases for new classes.
Incremental few-shot object detection.In the setting of incremental few-shot object detection (iFSD) Perez-Rua et al.
(2020), abundant base-class samples and a few novel-class samples are available.The model can use all the base-class samples for bootstrapping as prior knowledge is required for the model to learn in the few-shot way.Equipped with the prior knowledge of base-class data, the model cannot visit base-class samples again when learning knowledge of novel classes.In other words, the model with prior knowledge should be able to learn from the few samples of unseen categories without relearning basic knowledge, which is aligned with the practical application scenes where the pre-trained model should be competent to adapt to unseen information incrementally.
Despite many studies sharing similar settings to FSCIL, the current mainstream in academia still focuses on FSCIL.Therefore, this review primarily focuses on the more challenging FS-CIL research.

Methods for few-shot learning
For FSL tasks, specialized network architectures or tricks are typically required to handle limited annotated data.In FSCIL research, many methods build upon advancements in FSL.In this section, we focus on providing a brief overview of commonly used network architectures in FSL, without discussing the novelty or effectiveness of the methods.And they might not represent the latest research.
Numerous surveys have been conducted on the topic of FSL, proposing various classification approaches (Jadon, 2020;Song et al., 2023a;Wang et al., 2020b).One straightforward approach is to categorize FSL into four categories: data augmentation methods, metric-based methods, model-based methods, and optimization-based methods (Jadon, 2020).Hereafter, we will provide a brief introduction to the commonly used network architectures within these four categories.

Data augmentation methods
In FSL, data augmentation is an important strategy.It alleviates the problem of data scarcity by increasing the diversity of existing data, rather than collecting new data.Data augmentation significantly reduces the risk of overfitting and effectively enhances the model's generalization ability.Data augmentation can be categorized by its source: transforming samples from the training set, transforming samples from a weakly labeled or unlabeled data set, or transforming samples from similar data sets (Wang et al., 2020b).Besides directly augmenting the data, one can also train a model to generate new samples or features (Kong et al., 2022), such as VAEs or GANs, to achieve the goal of data augmentation.

Metric-based methods
Methods based on metrics classify objects in the embedded space by computing the similarity or distance between samples in the support set and the query set.For instance, by calculating the Euclidean distance between a test sample and each class in the support set, the test sample is assigned to the category of the nearest support set sample.In FSL, commonly used metric learning methods include Siamese Network (Koch et al., 2015), Matching Network (Vinyals et al., 2016), and Prototypical Network (Snell et al., 2017).Fig. 3 illustrates the network structure differences among these three methods.These methods do not require extensive data but optimize metrics to ensure similar samples are close, while samples of different classes are distant.

Model-based methods
Model-based methods primarily refer to designing or using specific network architectures to address FSL challenges.For instance, Memory-Augmented Neural Networks (MANN) (Santoro et al., 2016) use external memory spaces to explicitly store class information, thus leveraging the longterm memory capabilities inherent in neural networks for FSL tasks.Meta Networks (Munkhdalai and Yu, 2017) learn metalevel knowledge across tasks and adjust their inductive biases through quick parameterization for swift generalization.These network structures efficiently utilize a limited number of labeled samples for rapid learning and adaptation.

Optimization-based methods
Optimization-based methods focus on adjusting the training strategy of models to adapt to situations with limited annotated data.It typically involves modifying the loss function, regularization terms, or the optimization algorithm itself to ensure that the models can quickly converge on few-shot data without overfitting.For example, Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) is a common optimization technique that quickly learns knowledge from limited new data.It trains the model's initial parameters using various datasets to ensure peak performance when tackling new tasks.Building on MAML, Reptile (Nichol and Schulman, 2018) simplifies computational complexity by reducing gradient calculations from two steps to one, thereby increasing computational speed.

Few-shot class-incremental learning: taxonomy
For fundamental research on FSCIL, there is currently no unified classification standard.Zou et al. (2022) divided FSCIL into metric-based and fine-tuning-based methods.The metricbased method is similar to the concept of FSL (Snell et al., 2017;Vinyals et al., 2016), and its key issue lies in the prototype representation and similarity metric.In FSCIL, the finetuning-based approaches are widely used, and we refer to this method as Base Classes Pretraining and Novel Classes Finetuning (BPNF).
Definition 2. Base Classes Pretraining and Novel Classes Fine-tuning (BPNF) is a common approach used in FSCIL, which involves pre-training a model on data-rich base data and fine-tuning the model to better fit the novel classes in the incremental phase.This approach leverages the knowledge learned from the base classes to improve the model's performance on novel, unseen classes.

. Supervised learning strategies
The capacity of a model that has undergone fine-tuning through an incremental process is limited by the amount of new class sample data available.To alleviate this constraint, certain studies have introduced additional semi-supervised or unsupervised data, in addition to relying solely on labeled supervised data, to refine the supervision method.
In Cui et al. (2021), semi-supervised learning was introduced to FSCIL and, based on the setting in Tao et al. (2020), 50 unlabeled data were introduced in each incremental session.During the training process, the unlabeled data were combined with labeled data to enhance the performance of FSCIL.In Ahmad et al. (2022a), leveraging self-supervised learning was proposed to alleviate overfitting and catastrophic forgetting.Specifically, in addition to training the ResNet-18 model with base-class data, a deeper ResNet-50 network was trained using self-supervised methods on a large dataset.These two networks were then frozen to possess two powerful feature extractors.Two sets of feature vectors were input into a Gaussian Generator to learn models for new classes while passing their features.Subsequently, through feature fusion plus classifier, the forgetting can be effectively countered, and adaptation to the emergence of new classes can be achieved.For the first time, Kalla and Biswas (2022) proposed the self-supervised stochastic classifier (S3C) to solve FSCIL.The stochasticity of the classifier avoids overfitting to few-shot novel classes, while combining self-supervised training enables better preservation of base-class knowledge.

Statistical distribution
From the statistical distribution perspective, solving the FS-CIL problem involves fitting models to existing datasets and predicting the data distribution of the classes, which has excellent model interpretability.To address the limitations of common Gaussian process classification in large-scale class classifi- cation tasks, Achituve et al. (2021) proposed GP-Tree.GP-Tree is a tree-based hierarchical model that uses Polya-Gamma data augmentation to fit data to a Gaussian process, which can adapt well to the number of classes and data size.Liu et al. (2022a) proposed the learnable distribution calibration (LDC) approach, which is rooted in a parameterized calibration unit (PCU).PCU initializes the feature distribution of each class by using a Gaussian sampler defined by the mean vector and stored covariance matrix to generate a set of feature samples.Specifically, the Gaussian sampler generates enough feature samples during IL to form biased distributions for old and new classes.The PCU cyclically updates the generated feature samples, thereby restoring the old class distribution and calibrating the new class distribution.Due to the fixed size of the covariance matrix, this method has low memory consumption.Both methods achieve good results in FSCIL, but the drawback is that the modeling process is complex.

Function optimization
Existing methods focus on overcoming catastrophic forgetting when learning new tasks, while SHI et al. ( 2021) have analyzed this issue from the perspective of function optimization and found that flat local minima obtained during training on base classes have better generalization ability than sharp minima.Flat minima is a crucial concept in machine learning and optimization theory (Hochreiter and Schmidhuber, 1997).In the vicinity of flat minima, minor parameter alterations do not significantly impact the loss function, leading to models with robustness.Furthermore, flat minima serve as a natural form of regularization, typically preventing models from overfitting and enhancing their generalization capabilities.Specifically, SHI et al. (2021) suggest searching for flat local minima of the base training objective function and then fine-tune the model parameters within the flat region on new tasks, substantially reducing catastrophic forgetting.

Meta learning-based methods
In the realm of FSL or IL, meta-learning can leverage existing knowledge to address current learning problems, and im-prove the stability and reliability of the system through continuous knowledge accumulation.In FSL, meta-learning enhances the learning effect of the current task by utilizing data from other related tasks (Finn et al., 2017;Liu et al., 2020a;Rusu et al., 2019;Snell et al., 2017).In IL, meta-learning can be used to reduce dependence on new data, thereby avoiding overfitting (Riemer et al., 2019).It is natural to apply meta-learning to FSCIL.
Here, we divide the meta learning-based FSCIL method into two categories: prototype learning-based, and meta processbased method.

Prototype learning
Prototype learning aims to identify a small set of exemplars that accurately represent a given dataset, and then use the similarity between the data points and the prototypes to classify new data points or complete other visual tasks.Commonly used class prototypes are defined as follows: where S c is the set of all samples from class c; f θ is the embedding network parameterized by θ.Compared to traditional supervised learning methods, prototype learning requires less labeled data and has stronger generalization ability.However, simply aggregating all learned class prototypes using traditional prototype-based methods may render some prototypes indistinguishable from one another.To address this problem, Zheng and Zhang (2021) introduced the class structure regularizer to regulate the distribution of the learned classes in the embedding space of FSCIL.By using class distribution as prior knowledge to regularize the learning of new classes, this approach ensures that classes from the same or different sessions are distinguishable from one another.
In FSL, prototype-based methods face challenges in IL scenarios, primarily due to two issues: (i) With the increase in data volume, sample features or label distributions change because of potential concept drift or data distribution drift, making prototype samples fail to accurately represent the latest data dis-tribution; (ii) Newly introduced later-task classes might differ conceptually from earlier classes, causing conflicts within the prototype space, thereby affecting the efficacy of prototype distance measurement and consequently influencing classification accuracy.To address these issues, Zhu et al. (2021a) proposes an incremental prototype learning scheme consisting of random episode selection and dynamic relation projection.Random episode selection improves the extensibility of the feature representation by adapting gradients to different simulated incremental processes generated randomly.Dynamic relation projection utilizes the relationship matrix between new class samples and old class prototypes to update existing prototypes.
Learning Vector Quantization (LVQ) is a prototype clustering method that selects vector points as prototypes based on distance as the clustering criterion.Chen and Lee (2021) uses a non-parametric method based on LVQ in deep embedding space.They compress the information of the learning task into a few quantized reference vectors.These include within-class variation, less forgetting regularization, and calibrated reference vectors to alleviate catastrophic forgetting.Based on the idea of the CIL algorithm, Mazumder et al. (2021) proposes few-shot lifelong learning (FSLL).This algorithm selects some parameters to update in each incremental session to resist overfitting.At the same time, it minimizes the cosine similarity between the new class prototypes and old class prototypes to maximize their separation, thereby improving classification performance.
According to Hersche et al. (2022), the input images are mapped to quasi-orthogonal prototypes from the perspective of hyperdimensional computing.The proposed C-FSIL comprises a frozen meta-learned feature extractor, a trainable fixedsize fully connected layer, and a rewritable dynamically growing memory.The three parameter update forms provided effectively balance accuracy and compute-memory cost.In Yao et al. (2022), a human cognition-inspired prototype representation enhancement scheme is proposed for FSCIL.This method uses prototype representations and iteratively learns the knowledge of novel classes by exploring similarity correlations with previously learned classes.Yang et al. (2023) argue that misalignment between the feature and classifier of old classes caused by fine-tuning the backbone or previous classifier prototypes is the reason for forgetting.Inspired by the neural collapse theory, they align a set of prototypes during neural collapse with prototypes required for FSL, which improves the classifier's performance.
The aforementioned methods exhibit conciseness in their algorithms, but the semantic gap between the few-shot class prototypes and the real data distribution is a major obstacle to improving the accuracy of prototype-based methods.

Meta process
Inspired by the multi-task optimization method MAXL (Liu et al., 2019), Chi et al. (2022) proposed MetaFSCIL, which directly transforms adapting to new knowledge and retaining old knowledge into a meta-objective.They mimicked the scenario during meta-testing by sampling a sequence of incremental tasks from base classes.Furthermore, they proposed a bi-directional guided modulation based on meta-learning to automatically adapt to new knowledge.Drawing on metric learning within the context of meta-learning, Zou et al. (2022) discovered that using large margin classification improves the performance of the base classes but leads to a decrease in performance when learning new classes, a phenomenon termed class-level overfitting.The authors explain that this is due to the easily satisfied constraint of learning shared or class-specific patterns.Subsequently, they propose the boundary-based CLOM framework, which introduces an additional constraint that effectively addresses the aforementioned issue.Feature decoupling, which entails dividing features into distinct representations, allows models to concentrate on more pertinent information.According to Zhao et al. (2021), the disentanglement of features results in low-frequency components playing a more significant role in preserving old knowledge.Specifically, they employed discrete cosine transform to disentangle features and proposed a frequency-aware regularization method to enhance inter-space learning performance.Moreover, the proposed feature space composition operation further improves the inter-space learning performance.

Feature space
The representation of subspaces increases the efficiency of algorithms by mapping the original data to a low-dimensional space while preserving its useful features.Based on subspace representation, FSCIL projects new-class data into the subspace composed of base or old-class features, thereby enabling the model to better adapt to new classes.In Cheraghian et al. (2021b), a mixture of subspaces is proposed to describe the visual and semantic domain distribution of the data, which helps to avoid forgetting old classes.Additionally, a variational autoencoder is utilized to generate synthesized visual samples that enhance the performance of pseudo-features and prevent overfitting during IL of new classes.In Akyürek et al. (2022), the authors propose a subspace regularization scheme that encourages the weight representation of new classes to be close to the subspace spanned by the weights of existing old classes.This regularization term is straightforward and user-friendly, and can incorporate more prior knowledge.From the perspective of parameter feature space, Kim et al. (2023) proposed WaRP by fusing the advantages of F2M (SHI et al., 2021) for finding flat minimums of the loss function and FSLL (Mazumder et al., 2021) for parameter fine-tuning.They seek directions in the parameter space that are flat with respect to the loss function, and use the method of singular value decomposition to represent the parameter space.In each incremental session, they fine-tune unimportant parameters in the parameter space to learn novel classes.
Recently, Song et al. (2023b) presents the concept of fantasy space to enhance semantic knowledge.The core idea is to introduce placeholders for unseen classes within the fantasy space.These placeholders derive from the original classes using discrete transformation.By learning to recognize and contrast in the fantasy space fostered by virtual classes, it boosts base classes separation and novel classes generalization.

Prospective Learning
Backward compatibility is an issue that requires special consideration in the process of software updates.It demands that newer versions of software be able to accept data from previous versions.Conversely, forward compatibility requires that older versions of software be able to accept data from newer versions.From this perspective, the ability of the FSCIL model to overcome forgetting represents its backward compatibility (Zhou et al., 2022a).This means that a model trained on a new session should not forget old class samples.Few studies have addressed the model's forward compatibility, which involves preparing for possible novel classes and updates during current training sessions.Here, we define: Definition 3. Prospective Learning refers to a certain method or technique in FSCIL, where the model is trained on base dataset to have forward compatibility performance, thus enabling the model to better handle incremental few shot novel classes.
In order to enable the model to handle new classes, Zhou et al. (2022a) proposed forward compatible training (FACT), which allocates multiple virtual prototypes as a reserved space in the feature space to make the model scalable.FACT optimizes virtual prototypes to minimize intra-class distances and reserves more space for upcoming new classes.The model is made prospective through instance mixing to generate virtual instances.In subsequent research, Zhou et al. (2022b) proposed LIMIT, which creates fake FSCIL tasks from the base dataset and obtains generalizable features through metalearning from different fake tasks to prepare the model for real FSCIL tasks.Additionally, an instance-specific embedding is generated by a transformer-based meta-calibration module to further improve performance.From the perspective of openset recognition, Peng et al. (2022) linked FSCIL with openset tasks to prepare the model for new classes.Specifically, they proposed using angular penalty loss in face recognition to obtain good clustering features instead of cross-entropy loss.They combined class enhancement and data augmentation to improve the feature extractor's generalization ability for future incremental classes.

Replay-based methods
Based on the rehearsal technique, FSCIL approaches replay previously learned information for the task solver when presented with a new task.Replay-based methods employ episodic memory M to replay the examples from previous tasks while updating the model with the current task t.There are two types: direct replay involves saving examples from old tasks to M, while generative replay involves using a generative model to remember the distribution of data from old tasks and generate examples to M. When fine-tuning the model with data D t , the loss function can be expressed as: 4.4.1.Direct replay Kukleva et al. (2021) proposed a three-stage framework, wherein the first two stages train the network on base and novel classes separately and employ a model parameter constraint method to prevent forgetting of old classes.In the third stage, a small set of stored samples are used for replay and calibration of the classifier's performance across all classes (both base and novel classes).IL methods based on knowledge distillation usually store a set of old class exemplars and add additional distillation loss to transfer and preserve old knowledge (Castro et al., 2018;Hou et al., 2018;Rebuffi et al., 2017;Wu et al., 2019).However, due to class imbalance in few-shot scenarios and performance trade-offs between novel and base classes (Hou et al., 2019b), knowledge distillation is not the preferred method for FSCIL.Cheraghian et al. (2021a) proposed the semantic-aware knowledge distillation method by storing a small number of samples for the previous classes.By incorporating word embeddings as auxiliary information and mapping images to vector space, the effectiveness of knowledge distillation for FSCIL has been demonstrated.Unlike CIL based on individual knowledge distillation (Park et al., 2019), Dong et al. (2021) applied graph distillation techniques to FSCIL for the first time.They proposed a scheme for exemplar relation distillation incremental learning (ERDIL) based on graph relation knowledge distillation for knowledge extraction and representation.It effectively transfers old knowledge to the model for learning new tasks by maintaining a graph that represents the relationship between classes.

Generative replay
In light of the privacy issues caused by storing real old data, Liu et al. (2022b) proposes a data-free replay scheme for synthesizing old samples.By imposing entropy regularization, the generator is encouraged to produce uncertain examples that are closer to the decision boundary.Since the traditional generative replay paradigm in CIL cannot be applied to FSCIL, Agarwal et al. ( 2022) proposes few-shot incremental learning GAN (FSIL-GAN), which consists of a pre-trained feature extractor, a generator, a discriminator, and a semantic projection module.This is used to address the problem of approximating the real data distribution with a small amount of data.They first match class-specific synthesized visual features with their respective latent semantic vectors, and then ensure the diversity and distinguishability of the synthetic features through an anti modecollapse regularizer.However, this method's performance cannot be guaranteed for multi-domain data.

Dynamic network structure-based methods
Dynamic network structures (Chen et al., 2020;Sabour et al., 2017) enable automatic adjustment of network architecture during runtime, based on input data features, thereby possessing strong generalization capabilities and reduced risks of overfitting.Due to their flexibility and robust scalability, dynamic architectures have been extensively researched for their applications in IL (Aljundi et al., 2017;Rosenfeld and Tsotsos, 2018;Rusu et al., 2016).Leveraging these advancements, researchers have recently applied dynamic network structures in the context of FSCIL.Depending on the initial network structure employed, these methods can be categorized into three distinct groups.

Neural gas network
Tao et al. ( 2020) proposed the TOPIC framework, which utilizes a neural gas (NG) network to learn the topological structure of the feature space formed by different categories for knowledge representation.The stability of the NG's topology is maintained to prevent forgetting of old categories.With the dynamic growth of NG to accommodate new samples, the representation of few-shot new classes is improved.Fig. 5 (left) displays the stabilization and adaptation of TOPIC.

Graph attention network
The Graph attention network can dynamically process different types of graph data and make dynamic decisions based on the importance of nodes and edges learned on the graph.Zhang et al. (2021) have pointed out that decoupling the training process into embedding learning and classifier learning can effectively prevent knowledge forgetting in the backbone.They proposed the Continually Evolved Classifier (CEC), which first trains the backbone with base data to give the network strong feature extraction capabilities.Then, the graph attention model is introduced, and the graph attention network is used in the classifier layer to adapt to the changes of incremental tasks.With the arrival of incremental tasks, the nodes and weights of the Graph model dynamically increase.Fig. 5 (middle) illustrates the continual evolution of classifier.4.5.3.Dynamic neural networks (Yang et al., 2021) proposed a learnable expansion-andcompression network (LEC-Net) which enhances the feature representation capability by selectively expanding the network nodes and reduces feature drift from a model regularization perspective.Furthermore, they introduce the dynamic support network (DSN) (Yang et al., 2022) which can adaptively expand the network.DSN leverages compressive network expansion to enrich feature representation in each incremental task and dynamically adjusts the feature space by invoking the old class distribution.During each training, DSN selectively expands the network nodes to enhance the feature representation capability of incremental classes.Then, it dynamically compresses and expands the network through node self-activation to pursue a compact feature representation, thereby alleviating overfitting.Fig. 5 (right) shows the expansion and compression of DSN.
In the latest study, Yoon et al. (2023) explores a maskingbased method in network structure.They utilize non-binary masks to construct soft-subnetworks from the original network, effectively balancing forgetting and overfitting.In the base classes session, soft-subnetwork parameters and weight score are learned.In the incremental learning session, minor parameters of the subnetwork are updated.

Methods summary
This section reviews recent advancements in FSCIL.The following critically examines the strengths and weaknesses of various families.
Traditional machine learning methods offer promising research prospects.By carefully designing the supervised approach of the model, introducing additional data proves effective.Studying FSCIL from a statistical distribution or function optimization perspective enhances model interpretability.However, the complexity of statistical distribution modeling still presents difficulties.
Meta learning-based methods aim to make machine learning models more flexible and adaptive.But meta-learning typically assumes all tasks are from the same or similar data distributions and has high dependence on the meta-training set.When incremental tasks have different distributions from the base classes, model performance can be affected.
Feature and feature space-based methods leverage the core idea of learning more robust and efficient feature representations.In particular, prospective learning methods are worth exploring for their natural capability in handling unseen samples.
Replay-based methods directly address catastrophic forgetting in FSCIL.However, direct replay faces constraints in storage space, sample selection, and privacy.In contrast, generative replay partially alleviates these issues and offers a more flexible approach.Nevertheless, the challenges of training complexity and subpar data quality persist in generative replay methods.
Dynamic network structure-based methods serve as vital solutions to FSCIL challenges.They adapt to continuously changing data streams by adjusting model structures or inter-class relationships, thereby learning new knowledge while retaining old knowledge.Dynamic Networks have gained traction in IL (Wang et al., 2022a,c), and exploring their application in FSCIL is encouraged.
Overall, there remains an open research challenge to develop methods that harmoniously balance performance, scalability, efficiency, and complexity.

Model performance
In this section, we will present the performance of typical FS-CIL methods on three different datasets.Firstly, we will outline the methodology for model selection, followed by an introduction to the classical datasets and evaluation metrics.Finally, we will summarize the performance results of various models.

Model selection
Comparing the performance of different methods is necessary, but currently, many of these methods' codes are not publicly available.As most studies follow the standards set forth by Tao et al. (2020) (see Section 2.1), it is feasible to use the data reported in the original papers of the methods being compared, and we have adhered to this principle.Thus, the results  et al., 2020) uses loss constraints for topology updates.Middle: To make the classifier suitable for all categories, CEC (Zhang et al., 2021) applies graph models to the classifier.As new tasks emerge and categories increase, the classifier's topology continuously evolves.Right: When training on new classes, DSN (Yang et al., 2022) temporarily expands network nodes to learn new class features, and then compresses redundant nodes to provide a compact feature representation.

DSN
reported in this section are based on the original paper's reported data or the data processed from these original data.We have selected and compared the performance of 22 methods from five different families.
CIFAR-100 contains 100 classes with 600 RGB images per class, where each class has 500 training images and 100 testing images.The size of each image is 32 × 32 pixels.
MiniImageNet contains 60000 RGB images of size 84 × 84 pixels from ImageNet-1k (Deng et al., 2009).It possesses the same number of classes and samples as CIFAR-100, but its content is more complex and valuable for FSCIL research.
CUB-200 is currently the most widely used benchmark image dataset for fine-grained classification and recognition research.The dataset has a total of 11,788 bird images, including 200 bird subclasses, of which the training dataset has 5,994 images and the test set has 5,794 images.Each image has a size of 224 × 224 pixels.It provides more sessions and incremental classes for comparing the sensitivity of different methods.The performance of the selected method was evaluated on the three benchmark datasets mentioned above.For detailed dataset settings refer to Table 1.

Metrics
Considering the scarcity of original data reported in the paper, we solely compared the accuracy of each session, average accuracy (AA) of all sessions and Performance dropping rate (PD) (Zhang et al., 2021).PD measures the absolute accuracy drops in the last session w.r.t. the accuracy in the base session, defined as where A 0 is the classification accuracy in the base session and A N is the accuracy in the last session.

Benchmark results
Average performance.Table 2 presents the performance of typical FSCIL methods on different datasets.In the comparative experiments, all methods utilized ResNet as the backbone.However, there were variations in the specific ResNet The method name "VAE-based" is defined by us.
models used (e.g., .These differences are detailed in the table.We observe substantial performance disparities among different methods for various datasets.For the small-sized CIFAR-100 dataset, NC-FSCIL (Yang et al., 2023) exhibits outstanding performance at 67.50%, outperforming other methods by a large margin.For the more challenging MiniImageNet dataset, FeSSSS (Ahmad et al., 2022a) utilizes self-supervised learning for data augmentation and achieves a performance of 68.24%, surpassing NC-FSCIL (Yang et al., 2023) while also exhibiting lower knowledge forgetting.For the fine-grained CUB-200 dataset, only DSN (Yang et al., 2022) with AA surpasses 70% with a performance of 71.02%, demonstrating a better ability to capture the differences between categories.
Performance comparison by session.The accuracy of each session during the incremental process of various models on the CUB-200 dataset is illustrated in the line chart in Fig. 6.The accuracy of the model on the base classes limits the accuracy improvement during the incremental phase.With the exception of some early methods (TOPIC (Tao et al., 2020), SPPR (Zhu et al., 2021b), VAE-based (Cheraghian et al., 2021b)), most methods have an accuracy of 70% to 80% on the base dataset, and few methods have an accuracy above 80% on the base dataset (F2M (SHI et al., 2021), DSN (Yang et al., 2022), NC-FSCIL (Yang et al., 2023)).As the earliest research, TOPIC (Tao et al., 2020) was no longer competitive in each ses-sion of the training.F2M (SHI et al., 2021) based on function optimization and DSN (Yang et al., 2022) based on dynamic neural networks still demonstrate high performance advantages.

Performance comparison of accuracy and inference
speed Due to the unavailability of source code for most methods, in this part, we only select methods with publicly available code.We test the accuracy and inference speed of these methods on the CIFAR100 dataset.All experiments are conducted 50 times on an NVIDIA TITAN V GPU with 12GB of memory, and the average values are reported as the final results.The experimental results are presented in Fig. 7.It is noticeable that NC-FSCIL (Yang et al., 2023), based on the neural collapse theory, leads in both accuracy and inference speed.While SAVC (Song et al., 2023b) and CEC (Zhang et al., 2021) methods exhibit lower accuracy, they benefit from reduced model complexity, achieving the fastest inference speeds.

Research on few-shot incremental learning applications
In Section 3, the focus lies on fundamental research in FS-CIL.In this section, we primarily introduce research that concentrates on implementing FSCIL techniques to resolve practical predicaments.We do not distinguish between the variants of )36 FSCIL, such as FSCIL and FSIL, but rather focus on their applications.FSCIL, originating from computer vision (CV), has presently gained extensive usage in natural language processing (NLP) and Graph technology as well.Further subdivisions can be observed in Table 3.
6.1.Few-shot incremental learning in computer vision

Applications in image classification
To address the increasing demand for classification in hyperspectral imaging, Bai et al. (2020) proposes a linear programming IL classifier.In pedestrian attribute recognition for video surveillance, as the need for identifying new attributes increases, old models become inadequate.Based on the idea of meta-learning, Xiang et al. (2019) uses an attribute prototype generator module and attribute relationship module to generate novel classification weights from annotated data.
The FSCIL method mentioned in Section 3 is mainly used for general classification tasks and neglects the discrimination power of learned representations, making it unsuitable for finegrained image tasks.Based on the idea of meta-learning, Wang et al. (2020a) proposes the MetaSearch model to attempt to solve the few-shot incremental product search problem in shopping and checkout processes.MetaSearch extracts different features between various novel categories to perform incremental product search.The designed multipooling-based feature extractor can capture subtle differences between fine-grained product categories, thereby improving classification accuracy.To address the fine-grained vehicle recognition problem, a compact and separable feature learning method (CSFL) is proposed in Li and Huang (2022).CSFL first decouples the feature extractor from the classifier and uses metric learning to train the feature extractor.In the class incremental stage, only the classifier is updated, and incremental LDA is introduced to learn intra-class compact and inter-class separable features, thereby giving the model fine-grained image recognition capabilities.For the even more challenging ultra-fine-grained visual categorization task, Pan et al. (2023) proposes the use of selfsupervised learning and knowledge distillation to enhance the feature extraction ability of the network backbone, achieving better performance on fine-grained datasets than the classic FS-CIL method.

FSCIL in graph
HAG-Meta (Tan et al., 2022) The first GFSCIL research and solved based on prospective learning Amazon-Clothing (McAuley et al., 2015), DBLP (Tang et al., 2008), Reddit (Hamilton et al., 2017) Accuracy,PD,RPD Geometer (Lu et al., 2022) GFSCIL based on class prototype representation Cora-ML (Bojchevski and Günnemann, 2018),Cora-Full (Bojchevski and Günnemann, 2018),Flickr (Zeng et al., 2020) 2021) proposed a few-shot batch incremental road object detection method specifically designed for road objects.The DualFusion architecture they proposed consists of a Faster R-CNN used for base classes detection, a novel class detection network, and a fusion network.When detecting each new class, only 10 annotated instances are used.The limitation of this method is that although access to the base dataset is only required once, all novel few-shot data must be retained to permanently access novel class data.In the field of hot-rolled steel strip surface defects, Sun et al. (2022) proposes a new knowledge distillation network called dual knowledge align network.Following the BPNF guidelines, a knowledge distillation framework is designed for fine-tuning.They convert NEU-DET (Song and Yan, 2013) into an incremental few-shot dataset, and the experiment shows that they achieve great performance compared to other methods.Furthermore, the few-shot incremental object learning problem for robotic vision is highly valuable.Previous studies have explored the use of a small set of visual examples to incrementally train robots and enhance their recognition capabilities (Ayub and Wagner, 2020b).However, the few-shot incremental object learning problem for robotic vision remains unresolved (Ayub and Wagner, 2021).

Applications in image segmentation
Unlike image classification and object detection, image segmentation requires classification of each pixel, making it more challenging than the other two tasks.Instance segmentation, a subtask of image segmentation, is even more difficult than semantic segmentation as it requires distinguishing boundaries between different instances, while semantic segmentation only requires distinguishing objects and background.In the following, we will discuss some applications of FSCIL in semantic and instance segmentation.Cermelli et al. (2021) proposed the first attempt to solve incremental few-shot semantic segmentation.They proposed PIFS, which combines prototype learning with knowledge distillation.In the base stage, PIFS trains the network on base data to develop the capability of feature extraction.In the FSL stage, PIFS exploits prototypes to initialize classifiers of new classes and fine-tunes the network to refine its feature representation.The subsequently added prototype-based distillation loss enables the model to avoid overfitting and forgetting.Shi et al. (2022) proposed the Embedding adaptive-update and Hyper-class representation Network (EHNet) for incremental few-shot learning.The category embedding describes exclusive semantic properties, and the hyper-class knowledge expresses class-shared semantic properties.The category embedding is stored in the memory pool and can be updated adaptively.Subsequently, in the segmentation stage, EHNet guides the query image to segment the corresponding category.
For more challenging incremental few-shot instance segmentation, Ganea et al. (2021) introduced Model agnostic methods and proposed the first approach to solving this problem: iMTFA.It repurposes the Mask R-CNN network (He et al., 2017) to train feature extractors to generate discriminative embeddings for different instances.The average of those class embeddings is used as the representation for each class in the cosine similarity classifier.Thanks to the ability to predict localization and segmentation in a class-agnostic manner, adding new classes simply uses the representation of each class.When a new class appears, Nguyen and Todorovic (2022) finetunes the Mask-RCNN that was pre-trained on base classes.Specifically, they use Bayesian learning to estimate the classweight distribution to modify the classification head and com-pute the uncertainty of prediction to modify the bounding-box head.This results in better performance than iMTFA on the COCO dataset.However, they do not successfully explain why their estimation of the uncertainty of bounding-box localization surpasses a Gaussian-based uncertainty estimation (He et al., 2019).
6.2.Few-shot incremental learning in natural language processing FSIL is first proposed in the computer vision field, but with its increasing influence, many studies have applied its ideas to natural language processing (NLP).For instance, in fewshot intent recognition used for text data, Zhang et al. (2022) proposes constructing an undirected fully connected geometry structure based on the spatial distribution of selected samples in the embedding space.Subsequently, they apply a multisource contrastive-based loss to prevent the forgetting of the base classes and avoid overfitting of the novel classes.
Qin and Joty (2022) define relation learning in few-shot and incremental scenarios as continual few-shot relation learning and propose a method based on embedding space regularization and data augmentation to solve this problem.Wang et al. (2022b) use the generation-replay method to solve FSCIL for named entity recognition, which generates synthetic data of old entity classes for distillation.Qin and Joty (2021) propose a unified framework for lifelong few-shot language learning, LFPT5, based on prompt tuning of T5.LFPT5 performs well on three different tasks: sequence labeling, text classification, and text generation, and is suitable for real-world applications.
In addition, FSIL has also been applied to the fusion field of images and NLP.For example, in the label-to-image translation field, which uses deep learning algorithms to learn the mapping relationship from semantic space to image space.Chen et al. (2022) propose a FSIL method for label-to-image translation, which solves this task with semantically-adaptive filters and normalization.

Few-shot incremental learning in graph
Recent studies have applied FSCIL to graphs (Lu et al., 2022;Tan et al., 2022).To maintain consistency with existing literature, we refer to this as graph Few-shot class incremental learning (GFSCIL).One of the pioneering studies in this field is the HAG-Meta method proposed by Tan et al. (2022), which incorporates the previously mentioned Prospective Learning concept.HAG-Meta is based on the graph pseudo incremental learning paradigm and enables the model to learn new classes incrementally by cyclically adopting them from the base classes.Furthermore, it addresses class imbalance problems using hierarchical-attention-based modules.Lu et al. (2022) proposed Geometer to tackle GFSCIL problems.Geometer predicts the label of a node by identifying the nearest class prototype in the metric space and adjusts the attention-based prototypes by observing the geometric proximity, uniformity, and separability of novel classes.To mitigate catastrophic forgetting and unbalanced labeling issues, teacher-student knowledge distillation and biased sampling are also introduced.How-ever, both of these methods are unable to handle dynamic graph structures.

Future works
In this section, we discuss three key directions for the further development of FSCIL, namely, (i) theories, (ii) FSCIL settings and (iii) applications.

Theories
In order to further advance the field of FSCIL, there are several key areas that require attention in future research.Firstly, researchers should aim to enhance the efficiency of the algorithm by considering both performance and complexity.While many studies have solely focused on improving performance, it is important to also take into account the resource requirements of these methods.Secondly, it is crucial to improve testing standards to more accurately evaluate performance across multiple tasks and on the base dataset.Although the average accuracy metric is widely used, it fails to account for the issue of imbalanced base classes and novel classes data.Additionally, the performance dropping rate solely focuses on accuracy of the base and final tasks, without considering the accuracy of intermediate processes.In comparison, relative performance dropping rate (Tan et al., 2022) and harmonic accuracy (Peng et al., 2022) offer more comprehensive means of measuring model performance.Thirdly, as the ViT (Dosovitskiy et al., 2021) continues to gain importance, it may be worthwhile to explore its potential for use in FSCIL, as exemplified in Zhou et al. (2022b).By addressing these key areas, future research can build upon the current state-of-the-art and continue to advance this important area of machine learning.

FSCIL settings
The current experimental guidelines for FSCIL largely follow the setting proposed in Tao et al. (2020), which assumes a fixed number of new classes and samples per class in each incremental phase.However, this setting is difficult to meet in real-world applications.To better address this issue, Ahmad et al. (2022b) extended FSCIL to be variable, where in each incremental session, a learning agent can expect up to N ways and up to K shots.Additionally, Kalla and Biswas (2022) proposed a more general setting, where novel classes have different numbers of samples, known as FSCIL-imbalanced, and the number of base classes is not abundant, known as FSCIL-less base.Exploring approaches closer to real-world applications, such as how to handle variable numbers of new classes and shots in different sessions, has practical significance.It is also worth investigating the fusion of FSL with Task-IL and Domain-IL, which are promising research directions.

Applications
The application of FSCIL in various interdisciplinary fields is a promising avenue for exploration in the future.For instance, recent research has introduced FSIL into the field of audio (Wang et al., 2021b), dynamic few-shot learning for multilabel audio classification (Gidaris and Komodakis, 2018), automatic radar modulation recognition (Luo et al., 2022), intrusion detection (Wang et al., 2021a), and medical time-series classification (Sun et al., 2023).However, these methods are limited to single-scene settings, thus lacking scalability.Therefore, establishing a unified theoretical framework that is applicable to a wide range of scenarios is one of the future directions to address complex and multimodal tasks.

Conclusion
Few-shot class-incremental learning is a challenging yet crucial task.It reflects how humans learn in real-world scenarios where high-quality data is often limited and learning data is continually presented.In this paper, we have provided a comprehensive survey of existing FSCIL approaches and attempted to categorize them into five families, including traditional machine learning methods, meta learning-based methods, feature and feature space-based methods, replay-based methods, and dynamic network structure-based methods.Integrating these methodologies to balance performance, scalability, efficiency, and complexity may provide a direction for future research.We have also discussed the performance of classic FSCIL methods and the applications of FSCIL in various fields of deep learning.However, FSCIL remains an underexplored area, and further research is required to explore its potential applications and theories.Due to limitations of space, some theoretical derivations of the content were not extensively introduced.With the increasing demand for real-world AI applications, FSCIL research will continue to attract more attention and drive new innovations in the field of deep learning.

Figure 2 :
Figure 2: Dataset setting.Figure adapted fromZheng and Zhang (2021) ( j) train with the corresponding label space of Y j .Training data from different sessions are disjoint, that is, Y a ∩Y b = ∅ (a b).The limited instances in D ( j) train can be organized as N-way Kshot data format, i.e., there are N classes in the dataset, and each class has K training images.Facing a new dataset D ( j) train , a model should learn new classes and meanwhile maintain performance over old classes, i.e., minimize the expected risk R over all the seen classes:

Figure 3 :
Figure 3: Common network architectures in metric-based methods: (a) Siamese Networks: Utilize twin subnetworks to extract features from two input samples and compute the distance between these features; (b) Matching Networks: By using the attention mechanism to dynamically match and aggregate the support set and query set examples, Matching Networks can generate class-related feature representations for query samples; (c) Prototypical Networks: Represent each class by the mean of their features.Thus, in the embedding space, closer features are more likely to belong to the same class.
Figure 4: Chronological overview of key FSCIL research developments.

Figure 5 :
Figure 5: During training, the network structure dynamically adjusts.Left: Sample features form the neural graph's topology.With new nodes added, TOPIC(Tao et al., 2020) uses loss constraints for topology updates.Middle: To make the classifier suitable for all categories, CEC(Zhang et al., 2021) applies graph models to the classifier.As new tasks emerge and categories increase, the classifier's topology continuously evolves.Right: When training on new classes, DSN(Yang et al., 2022) temporarily expands network nodes to learn new class features, and then compresses redundant nodes to provide a compact feature representation.

Figure 6 :
Figure 6: Accuracy curves of different methods on each session of CUB-200 dataset.

Figure 7 :
Figure 7: Performance comparison of various methods on CIFAR100: FPS vs. Accuracy et al., 2022) Designing a class-conditional hypernetwork for incremental few-shot object detection COCO,LVIS AP Incremental-DETR (Dong et al., 2022) DETR method based on fine-tuning and self-supervised learning Pascal-VOC,COCO AP MCH, BPMCH (Feng et al., 2022) Analogous to the maintenance of new knowledge by establishing new connections et al., 2021) Prototype-based incremental few-shot semantic segmentation

Table 1 :
Experimental setup for the three datasets

Table 3 :
Summary of FSCIL applied research, including CV, NLP, and graph domains.
Yin et al. (2022))017), a few-shot meta-learning algorithm that uses gradient descent to identify an appropriate initialization that can quickly adapt to the few samples of unseen classes.However, due to overfitting of the feature extractor on base class samples, the model's generalization of output features is inadequate, limiting the proposed model's performance on new classes.Yin et al. (2022)proposed a hypernetwork framework for iFSD called Sylph.It uses a base detector and hypernetwork architecture similar to ONCE.Unlike ONCE, they trained a base detector with class-agnostic localization capability on abundant base dataset, thus decoupling localization from classification.This simplifies the task, but when the size of the base dataset is small or the dataset quality is poor, the class-agnostic detector's localization ability is poor.