Concept Drift Adaptation Methods under the Deep Learning Framework: A Literature Review

: With the advent of the fourth industrial revolution, data-driven decision making has also become an integral part of decision making. At the same time, deep learning is one of the core technologies of the fourth industrial revolution that have become vital in decision making. However, in the era of epidemics and big data, the volume of data has increased dramatically while the sources have become progressively more complex, making data distribution highly susceptible to change. These situations can easily lead to concept drift, which directly affects the effectiveness of prediction models. How to cope with such complex situations and make timely and accurate decisions from multiple perspectives is a challenging research issue. To address this challenge, we summarize concept drift adaptation methods under the deep learning framework, which is beneﬁcial to help decision makers make better decisions and analyze the causes of concept drift. First, we provide an overall introduction to concept drift, including the deﬁnition, causes, types, and process of concept drift adaptation methods under the deep learning framework. Second, we summarize concept drift adaptation methods in terms of discriminative learning, generative learning, hybrid learning, and others. For each aspect, we elaborate on the update modes, detection modes, and adaptation drift types of concept drift adaptation methods. In addition, we brieﬂy describe the characteristics and application ﬁelds of deep learning algorithms using concept drift adaptation methods. Finally, we summarize common datasets and evaluation metrics and present future directions.


Introduction
The 2019 outbreak of coronavirus disease (COVID-19) has distinct effects on people's health and quality of life, and there is great uncertainty regarding the outbreak's evolution, duration, and scope in the future. So, in this era of the epidemic and big data, decision makers face a series of problems, such as extensive databases, rapid growth, diversified data sources, and rapid changes in data distribution. At present, deep learning technologies can solve part of the problems and provide part of the guidance to decision makers [1]. However, it does not adapt well to the changing environment. Once the environment changes, the new data does not match the distribution of the old data. With concept drift occurring [2], the deep learning model will become obsolete and invalid. How to deal with this complex situation and make timely and accurate decisions from multiple perspectives is a challenging research problem. Concept drift adaptation methods offer the possibility of solving the problem [3], helping decision makers to find the optimal or most satisfactory solution in this dynamic and complex situation. These methods can continuously capture the potential danger of events by analyzing the data stream, deal with distribution changes in the data stream on time, and help decision makers update existing decision results to prevent losses due to decision making.
(1) We review concept drift adaptation methods under the deep learning framework from four aspects-discriminative learning, generative learning, hybrid learning, and relevant others-so as to fill the gap in this area of investigation in previous work. (2) We reveal the general operation process of concept drift adaptive methods under deep learning frameworks and explain concept drift detection modes and update modes in detail. (3) We summarize the representative algorithms of each subcategory, common datasets, evaluation metrics, their application areas, and limitations. (4) We analyze and discuss the current problems of concept drift adaption methods and point out the future direction.
The rest of the paper is structured as follows. Section 2 provides the definition, causes, and types of concept drift and introduces the process of concept drift adaptation methods under the deep learning framework. Section 3 classifies concept drift adaptation methods based on deep learning and reviews the existing methods in the literature. Section 4 Appl. Sci. 2023, 13, 6515 3 of 27 summarizes the common datasets and evaluation metrics. Section 5 provides future research, and Section 6 concludes this paper.

Overview of Concept Drift
In this section, we introduce the definition and causes of concept drift, different types of concept drift, and the process of concept drift adaptation methods. Concept drift was first proposed by Schlemmer et al. [2] in 1986 and mainly refers to the fact that the underlying data stream distribution changes over time [18,19].

The Definition of Concept Drift
Assuming that P t 0 represents the joint probability distribution between the input variable x and the target variable y at time t 0 and P t 1 represents the joint probability distribution between the x and y at t 1 , then concept drift will occur if Equation (1) holds when t 0 turns to t 1 .
∃x : P t 0 (x, y) = P t 1 (x, y) At this time, the underlying data distribution no longer conforms to concept C 1 , and a new concept C 2 is generated. Due to the characteristics of joint probability P t (x, y) = P t ( x)P t (y|x) if Equation (2) is satisfied when t 0 turns to t 1 , concept drift will also occur. ∃x : P t 0 (x)P t 0 (y|x) = P t 1 (x)P t 1 (y|x) (2) Changes in both P t (x) and P t (y|x) can lead to concept drift.

The Causes of Concept Drift
According to the definition of concept drift and the characteristics of joint probability, it can have the following three causes: (1) Virtual concept drift. When the probability of x changes, but the probability of y under the condition of x does not change, i.e., P t 0 (x) = P t 1 (x) and P t 0 (y|x) = P t 1 (y|x). This case belongs to virtual concept drift, which does not affect its decision boundary and only changes the feature space. (2) Real concept drift. When the probability of y under the condition of x changes, the probability of x remains the same, i.e., P t 0 (y|x) = P t 1 (y|x) and P t 0 (x) = P t 1 (x). This case has a direct impact on the prediction model and is a real concept drift, which not only changes the feature space but also changes its decision-making boundary. (3) Hybrid concept drift. In an open environment, both real concept drift and virtual concept drift can exist in the data stream at the same time, i.e., P t 0 (x) = P t 1 (x), P t 0 (y|x) = P t 1 (y|x). This is a mixed concept drift, which is most common.
It is worth noting that according to the Bayesian decision theory [20], we obtain Equation (3): It can be seen that P t (y) and P t (x|y) also affect P t (y|x), thus indirectly causing a real concept drift. The specific manifestations of the concept drift due to different causes are shown in Figure 1, in which (X 1 , X 2 ) represents the two-digit feature space and y represents its category label.
For example, in stock trading, users can be divided into profitable and non-profit stocks according to profitability. When a user considers purchasing stocks, a change in the channel of purchase or a small change in the number of purchases does not affect the trend of the stock. However, if affected by an outbreak, the trend of stocks may change, thus directly affecting stock returns. This situation belongs to the real concept drift, so users need to reconsider and make decisions. In real life, virtual drift tends to have less impact on the outcome of a decision. There will be no loss to decision makers. However, real concept drift tends to have a direct impact on decision outcomes due to changes in its Appl. Sci. 2023, 13, 6515 4 of 27 data relationships. It requires decision makers to discover in time and re-make decisions to avoid losses. Appl For example, in stock trading, users can be divided into profitable and non-profit stocks according to profitability. When a user considers purchasing stocks, a change in the channel of purchase or a small change in the number of purchases does not affect the trend of the stock. However, if affected by an outbreak, the trend of stocks may change, thus directly affecting stock returns. This situation belongs to the real concept drift, so users need to reconsider and make decisions. In real life, virtual drift tends to have less impact on the outcome of a decision. There will be no loss to decision makers. However, real concept drift tends to have a direct impact on decision outcomes due to changes in its data relationships. It requires decision makers to discover in time and re-make decisions to avoid losses.

The Types of Concept Drift
The changes in concept may manifest in different forms over time. At present, the most popular types of concept drift can be divided into abrupt drift, incremental drift, gradual drift, and recurring drift [14,16,17].
Abrupt drift refers to the rapid change of concept C1 into concept C2 in a short period of time, and if an earthquake suddenly occurs in a certain place, its economic model changes instantaneously, as shown in Figure 2a. Incremental drift refers to the slow transformation of concept C1 into concept C2 in a continuous manner, as the economy gradually recovers after an earthquake, as shown in Figure 2b. Gradual drift refers to a short period of time: C1 and C2 repeatedly switch and eventually stabilize at C2, as the equipment ages, occasionally fails, and finally stops working, as shown in Figure 2c. Recurring drift refers to the fact that over time, the previous concept will reappear after a period of time; for example, the sales of down jackets meet concept C1 in the winter, start to enter the offseason after the end of the winter, their sales will meet concept C2, and then the next winter concept C1 will reappear, as shown in Figure 2d. In addition, the speed of recurring drift can be abrupt, gradual, or incremental. It can also be periodic or irregular. In the academic research of concept drift, the types are different according to the classification criteria. However, it is common to divide the types of concepts according to their transformations, and this criteria manifestation is more intuitive. In related studies, different methods adapt to solving different types of concept drift. For example, the drift detection method (DDM) algorithm [21] is more suitable for abrupt drift. In addition to

The Types of Concept Drift
The changes in concept may manifest in different forms over time. At present, the most popular types of concept drift can be divided into abrupt drift, incremental drift, gradual drift, and recurring drift [14,16,17].
Abrupt drift refers to the rapid change of concept C 1 into concept C 2 in a short period of time, and if an earthquake suddenly occurs in a certain place, its economic model changes instantaneously, as shown in Figure 2a. Incremental drift refers to the slow transformation of concept C 1 into concept C 2 in a continuous manner, as the economy gradually recovers after an earthquake, as shown in Figure 2b. Gradual drift refers to a short period of time: C 1 and C 2 repeatedly switch and eventually stabilize at C 2 , as the equipment ages, occasionally fails, and finally stops working, as shown in Figure 2c. Recurring drift refers to the fact that over time, the previous concept will reappear after a period of time; for example, the sales of down jackets meet concept C 1 in the winter, start to enter the off-season after the end of the winter, their sales will meet concept C 2 , and then the next winter concept C 1 will reappear, as shown in Figure 2d. In addition, the speed of recurring drift can be abrupt, gradual, or incremental. It can also be periodic or irregular. For example, in stock trading, users can be divided into profitable and non-profit stocks according to profitability. When a user considers purchasing stocks, a change in the channel of purchase or a small change in the number of purchases does not affect the trend of the stock. However, if affected by an outbreak, the trend of stocks may change, thus directly affecting stock returns. This situation belongs to the real concept drift, so users need to reconsider and make decisions. In real life, virtual drift tends to have less impact on the outcome of a decision. There will be no loss to decision makers. However, real concept drift tends to have a direct impact on decision outcomes due to changes in its data relationships. It requires decision makers to discover in time and re-make decisions to avoid losses.

The Types of Concept Drift
The changes in concept may manifest in different forms over time. At present, the most popular types of concept drift can be divided into abrupt drift, incremental drift, gradual drift, and recurring drift [14,16,17].
Abrupt drift refers to the rapid change of concept C1 into concept C2 in a short period of time, and if an earthquake suddenly occurs in a certain place, its economic model changes instantaneously, as shown in Figure 2a. Incremental drift refers to the slow transformation of concept C1 into concept C2 in a continuous manner, as the economy gradually recovers after an earthquake, as shown in Figure 2b. Gradual drift refers to a short period of time: C1 and C2 repeatedly switch and eventually stabilize at C2, as the equipment ages, occasionally fails, and finally stops working, as shown in Figure 2c. Recurring drift refers to the fact that over time, the previous concept will reappear after a period of time; for example, the sales of down jackets meet concept C1 in the winter, start to enter the offseason after the end of the winter, their sales will meet concept C2, and then the next winter concept C1 will reappear, as shown in Figure 2d. In addition, the speed of recurring drift can be abrupt, gradual, or incremental. It can also be periodic or irregular. In the academic research of concept drift, the types are different according to the classification criteria. However, it is common to divide the types of concepts according to their transformations, and this criteria manifestation is more intuitive. In related studies, different methods adapt to solving different types of concept drift. For example, the drift detection method (DDM) algorithm [21] is more suitable for abrupt drift. In addition to In the academic research of concept drift, the types are different according to the classification criteria. However, it is common to divide the types of concepts according to their transformations, and this criteria manifestation is more intuitive. In related studies, different methods adapt to solving different types of concept drift. For example, the drift detection method (DDM) algorithm [21] is more suitable for abrupt drift. In addition to adapting the four common types above, there are some methods for distinguishing real drift and virtual drift, avoiding mixing with virtual concept drift or outliers and noise. For example, the RRBM-DD [22] considers explicitly how to identify the drift of the real concept. Although concept drift adaptation methods cannot solve all types of concept drift at one time, they can still solve multiple concept drifts, which belongs to a one-to-many relationship.
In recent years, there have also been many excellent concept drift detection algorithms to detect multiple concept drifts, for example, based on sliding-window algorithms, OCDD [23], CDT_MSW [24], and KSWIN [25]. OCDD mainly has two sliding windows to store new and old data, and the percentage of outliers detected by the classifier in the sliding window is used to send a drift signal, which is more suitable for detecting abrupt and incremental drift, but hyperparameter settings are required. CDT_MSW also has two windows, the difference being that it can identify the position and length of concept drift, so as to accurately determine the type of concept drift. KSWIN detects concept drift by applying the "Kolmogorov-Smirnov test". These algorithms are based on supervised learning. Unsupervised concept drift algorithms include LD3 [26], STUDD [27], and CDCMS [28]. LD3 introduces the concept of label-dependent ordering for concept drift detection in multilabel classification, which is more suitable for mutation and incremental drift. STUDD mainly creates an auxiliary model (students) to mimic the behavior of the main model (teacher), uses the teacher to predict new instances, and monitors the student's imitation loss to detect concept drift. It is more suitable for abrupt, gradual, and incremental drift. CDCMS mainly uses novel clustering and diversity-based memory management strategies in model space strategies to deal with concept drift and has good effects in dealing with abrupt and recurring drift. Finally, it is worth mentioning that most of the concept drift detection algorithms either occupy more memory or have a slow detection speed. DMDDM [29] is based on the Page-Hinkley test, which effectively improves the detection speed of concept drift and overcomes the limitations of cost and execution time but is only suitable for abrupt drift. How to achieve a cost-saving detection algorithm that covers all drift types is also a major challenge. Therefore, we will also summarize the types of conceptual drift for each method.

The Process of Concept Drift Adaptation Methods under Deep Learning Framework
The general adaptation process of concept drift under the deep learning framework when dealing with unstable state data streams is shown in Figure 3. First, the data stream input (single input or batch input) is generally trained and learned by the deep learning model (single model or ensemble model) to obtain the basic prediction results. Next, if concept drift occurs during this process, a concept drift adaptation method will be triggered to update the deep learning model to accommodate concept drift and maintain the model [16,19]. The concept drift adaptation method can be divided into two parts: concept drift detection and model update. Among them, concept drift detection contains both active and passive modes, and model updates can be divided into structure updates and parameter updates. Active modes mean that the learning process of a deep learning model contains a concept drift detection algorithm. When concept drift is detected, the concept drift adaptation method will be triggered to update the model. Passive mode means that the method continuously adjusts its model as data are continuously input during the learning process.  Active modes mean that the learning process of a deep learning model contains a concept drift detection algorithm. When concept drift is detected, the concept drift adaptation method will be triggered to update the model. Passive mode means that the method continuously adjusts its model as data are continuously input during the learning process. Instead of using a drift detection algorithm, it uses a concept drift adaptation method to passively update the model continuously. After triggering the concept drift application mechanism, the deep learning model is generally updated to adapt to the concept drift through a model parameter update or a structure update [30]. Model parameter updates can be divided into full parameter updates and partial parameter updates. In particular, parameter updates also include parameter updates between ensemble models. Here, parameter updates are also weight updates. In addition, model structure updates can generally be performed by adjusting the width and depth of the network.

Concept Drift Adaptation Methods under Deep Learning
In this section, we will summarize concept drift adaptation methods according to the classification of deep learning [31], including discriminative learning, generative learning, hybrid learning, and others, as shown in Figure 4. For each part, we will explain the update modes, types of drift adapted, and detection modes. In addition, we will also introduce the characteristics and application fields of deep learning techniques using concept drift adaptation methods.

Concept Drift Adaptation Methods Based on Discriminant Learning
This type of deep learning technique is used in supervised or classification applications by describing the posterior distributions of conditioned visible data. A discriminative model is a model that learns the relationship between input data and output labels, and it predicts output labels by learning the characteristics of the input data. In classification problems, the main purpose is to assign each input vector a to label b. Discriminant models attempt to directly learn the function f(a) that maps input vectors to labels. The classifier first learns the posterior class probability P(b = k|a) from the training data and

Concept Drift Adaptation Methods Based on Discriminant Learning
This type of deep learning technique is used in supervised or classification applications by describing the posterior distributions of conditioned visible data. A discriminative model is a model that learns the relationship between input data and output labels, and it predicts output labels by learning the characteristics of the input data. In classification problems, the main purpose is to assign each input vector a to label b. Discriminant models attempt to directly learn the function f (a) that maps input vectors to labels. The classifier first learns the posterior class probability P(b = k|a) from the training data and assigns a new sample a to the class with the highest posterior probability based on these probabilities, where k stands for class. The general process of the discriminant concept drift adaption method is shown in Figure 5, while the two methods of active detection and parameter update mode account for a relatively large proportion. Discriminant learning mainly includes multilayer perceptrons (MLPs), convolutional neural networks (CNNs), recurrent neural networks (RNNs), and their variants. •

MLP-based concept drift adaptation methods
MLP is a discriminant learning model widely adopted in decision making [32], which is often used in combination with concept drift adaptation methods to solve classification problems for unstable streaming data. However, concept drift adaptation methods are computationally expensive and converge slowly each time the model is updated due to the hyperparameter problem. Typical algorithms, such as selective ensemble-based online adaptive deep neural networks (SEOAs), bilevel online deep learning (BODL), neural networks with dynamically evolved capacity (NADINEs), and Adadelta optimizer-based deep neural networks with concept drift detection (CIDD-ADODNNs), are elaborated as follows.
SEOA [33] uses a deep learning model with L network layer MLPs to form L basic classifiers. It then dynamically adjusts the parameters of each basic classifier to handle concept drift and regularly selects base classifiers with different convergence and fitting abilities. It enhances the adaptability and generalization ability of the model to data distribution, which is more suitable for gradual, incremental, and recurring drift, although less suitable for dealing with high-dimensional non-linear problems. BODL [34] mainly uses the MLP model for classification prediction and detects concept drift based on the classifier's error rate. When concept drift is detected, the model's parameters are updated through a bilevel optimization strategy and the exponential gradient descent method to adapt to the abrupt concept drift, but its limitation is that the added classes cannot be identified online. In contrast, the convergence speed of the algorithm for model structure update can be slower. NADINE [35] uses a drift detection mechanism to detect concept drift actively. The drift detection mechanism mainly adds an adaptive windowing strategy to the prominent Hoeffding's bound detection algorithm. When the drift signal is detected, its network structure will be updated to adapt to concept drift, which mainly adjusts its network structure through the hidden unit growing strategy and hidden unit pruning strategy. The main advantage of NADINE over other algorithms is its elastic structure and online learning trait, but the training time of the model is relatively slow. It can be applied to classification and regression problems. Additionally, scholars have researched it for certain special data. CIDD-ADODNN [36] is adopted for the classification model of highly unbalanced data flow, which mainly uses an adaptive sliding window (ADWIN) drift detection algorithm to actively detect concept drift and then updates the network parameters through the Adadelta optimizer, so as to adapt to the abrupt and gradual drift. This algorithm effectively improves the classification performance of highly unbalanced data streams, although its feature selection needs to be optimized. • RNN-based concept drift adaptation methods •

MLP-based concept drift adaptation methods
MLP is a discriminant learning model widely adopted in decision making [32], which is often used in combination with concept drift adaptation methods to solve classification problems for unstable streaming data. However, concept drift adaptation methods are computationally expensive and converge slowly each time the model is updated due to the hyperparameter problem. Typical algorithms, such as selective ensemble-based online adaptive deep neural networks (SEOAs), bilevel online deep learning (BODL), neural networks with dynamically evolved capacity (NADINEs), and Adadelta optimizer-based deep neural networks with concept drift detection (CIDD-ADODNNs), are elaborated as follows.
SEOA [33] uses a deep learning model with L network layer MLPs to form L basic classifiers. It then dynamically adjusts the parameters of each basic classifier to handle concept drift and regularly selects base classifiers with different convergence and fitting abilities. It enhances the adaptability and generalization ability of the model to data distribution, which is more suitable for gradual, incremental, and recurring drift, although less suitable for dealing with high-dimensional non-linear problems. BODL [34] mainly uses the MLP model for classification prediction and detects concept drift based on the classifier's error rate. When concept drift is detected, the model's parameters are updated through a bilevel optimization strategy and the exponential gradient descent method to adapt to the abrupt concept drift, but its limitation is that the added classes cannot be identified online. In contrast, the convergence speed of the algorithm for model structure update can be slower. NADINE [35] uses a drift detection mechanism to detect concept drift actively. The drift detection mechanism mainly adds an adaptive windowing strategy to the prominent Hoeffding's bound detection algorithm. When the drift signal is detected, its network structure will be updated to adapt to concept drift, which mainly adjusts its network structure through the hidden unit growing strategy and hidden unit pruning strategy. The main advantage of NADINE over other algorithms is its elastic structure and online learning trait, but the training time of the model is relatively slow. It can be applied to classification and regression problems. Additionally, scholars have researched it for certain special data. CIDD-ADODNN [36] is adopted for the classification model of highly unbalanced data flow, which mainly uses an adaptive sliding window (ADWIN) drift detection algorithm to actively detect concept drift and then updates the network parameters through the Adadelta optimizer, so as to adapt to the abrupt and gradual drift. This algorithm effectively improves the classification performance of highly unbalanced data streams, although its feature selection needs to be optimized. •

RNN-based concept drift adaptation methods
Compared to other neural networks, RNN has certain advantages in processing sequence data because it has at least one feedback connection [37]. To some extent, it can alleviate the problem of concept drift. However, its capacity is also limited, especially when it comes to processing long data. It is mainly used in the fields of electricity loading, weather forecasting, and anomaly detection. Typical algorithms are the online adaptive recurrent neural network (OARNN), ONU-SHO-based RNN (ONU-SHO-RNN), adaptive behavioralbased incremental batch learning malware variant detection model (AIBL-MVD), and multilayer self-evolving recurrent neural network (MUSE-RNN).
OARNN [38] mainly uses the RNN model to capture the temporal correlation and track its performance. When the performance deteriorates, the tree-structured Parzen estimator (TPE) will be used to optimize the hyperparameters of the model online. Then, the weights of the RNN model are completely updated and relearned from new data to accommodate concept drift over short periods of time. It is mainly used for energy and electricity load forecasting, although it requires a large amount of data for training and learning. In addition, Jagait et al. [39] proposed an online ARIMA-RNN integration based on OARNN, which belongs to hybrid learning. It will be further introduced later. ONU-SHO-RNN [40] determines whether to update the model by calculating its prediction accuracy and the concept drift detection of the RNN model on the data stream. It mainly uses the ONU-SHO algorithm to perform a complete parameter update and narrow the error between the target output and the measurement output. It is capable of fast convergence, adapting to incremental and gradual drift, although there are problems with update delays. In addition, AIBL-MVD [41] also adapts to incremental and gradual drift. It mainly uses the statistical process control (SPC) algorithm to actively detect the occurrence of concept drift and update all model weights through incremental learning. It is mainly used in the field of malware detection. In this process, the catastrophic forgetting problem is solved by mixing the new data with a subset of the old data. Its limitation is that labeled malware samples must be available right before updating the model. All of the above methods are based on parameter update mode. Subsequently, MUSE-RNN [42] mainly uses structural updates to update the models, and it actively detects concept drift through Hoeffding's bound detection algorithm, which is also a common method in concept drift detection algorithms. The model is updated by using growth and pruning hidden nodes and layers for the real-time classification of data streams, although it does not handle image streams. •

LSTM-based concept drift adaptation methods
Long short-term memory (LSTM) is a variant of RNN that solves problems such as vanishing gradients and is suitable for processing and forecasting important events with relatively long intervals and delays in time series [43]. LSTM-based concept drift adaptation methods are mainly used in the fields of anomaly detection, photovoltaic power generation prediction, and industrial prediction, and their typical algorithms include DL-CIBuild, I-LSTM, multi-objective metaheuristic optimization-based big data analytics with concept drift detection (MOMBD-CDD), adaptive LSTM (AD-LSTM), DCA-DNN, etc.
DL-CIBuild [44] is an algorithm based on the LSTM model to construct prediction models for continuous integration (CI) build outcome prediction. It uses the genetic algorithm (GA) to adjust the hyperparameters (including the number of hidden layers and neurons) of the LSTM model. In particular, it does not require a very large dataset size and has good robustness. However, the algorithm is relatively expensive in terms of labor and requires the construction of annotated datasets. I-LSTM [5] combines the idea of time factor with stratified sampling. Therefore, the newer the data, the higher the weight assigned to accommodate concept drift, but there are also problems with balancing old and new data. Overall, it improves multi-classification performance for anomaly detection, mainly for IoT applications. MOMBD-CDD [45] mainly deals with high-dimensional streaming data. It mainly uses the Statistical Test of Equal Proportions method (STEPD) to detect concept drift and combines the glowworm swarm optimization (GSO) algorithm to update the bidirectional long short-term memory (Bi-LSTM) model by adjusting weights. However, it is more computationally intensive and takes up more resources. In this process, STEPD defines two windows, a recent window r and an overall window o. This is also common in deep-learning-based concept drift adaptation methods. It applies the statistical test of equal proportions to compare the accuracies between the two windows as shown in Equations (4) and (5): Its value is compared to the percentile of the standard normal distribution to obtain the observed significance level (p-value). p-value is equivalent to the chi-square test with Yates's continuity correction, in which v is the value of accurate predictions, and n is the number of samples for the window. The calculation formula for µ is shown in Equation (5). If p-value < αd, STEPD predicts a concept drift. If p-value < α w , STEPD predicts a warning that concept drift may occur. α d is the concept drift significance level; α w is the warning significance level. Fog-DeepStream [46] uses wavelet transform to reduce the dimensionality of the data and LSTM models to predict future behavior for data stream analysis on fog computing. It uses a drift detection algorithm to determine the occurrence of conceptual drift, and when a conceptual drift is detected, parameters are updated to accommodate the conceptual drift. The method tries three drift detection algorithms: cumulative sum (CUSUM), Page-Hinkley, and exponentially weighted moving average (EWMA). However, this algorithm also takes up a lot of memory.
The above algorithms are used in the Internet field. Next, we introduce algorithms in other fields. For example, AD-LSTM [47] is used for predicting photovoltaic power generation. It actively detects the occurrence of concept drift through the sliding window (SDWIN) algorithm and adopts the second stage of the two-phase adaptive learning strategy (TP-ALS) to fine-tune the prediction model. DCA-DNN [48] is mainly used for industrial prognosis and is based on the LSTM-FC model, which actively detects the occurrence of concept drift through the dendritic cell algorithm. It generates synthetic data using a kernel density estimator with drift-based bandwidth, which can be used to fine-tune the weights of the last layer to achieve faster adaptation and mitigate the problem of limited new samples. Both of the above algorithms suffer from model update delays, and their concept drift detection algorithms need to be optimized.

•
CNN-based concept drift adaptation methods CNN is a feed-forward neural network in which the connections between neurons in its convolutional layer are not fully connected, and the weights and biases of connections between some neurons in the same layer are shared [49,50]. So, the computational cost of this concept drift adaptation method is also relatively low. Typical algorithms, such as the evaluative convolutional neural network (ECNN) [51], mainly use re-weighting operation technology to dynamically update the model, so as to solve the concept drift problem in high-throughput data. ECNN overcame the "over-fitting" and "under-fitting" problems. ECNN is the first online deep learning technique to be introduced into marine data prediction research, although it is relatively computationally expensive. Online CNNbased model selection using performance gradient-based saliency maps (OS-PGSM) [52] is mainly applied to time-series prediction and uses Hoeffding's bound detection algorithm to actively detect the occurrence of concept drift. When concept drift occurs, the region of competence (ROC) of the model will be recalculated to update the weights. It has a low computational cost, using significance plots to provide an explanation for model selection, but hyperparameter settings need to be optimized. Deep incremental hashing (DIH) [53] focuses on semantic image retrieval using a CNN model. The parameters of the CNN are updated using a point-by-point loss function guided by the similarity of the current data block keeping the target code. DIH mainly adapts to gradual, incremental drift. It also has certain limitations, such as not considering the semantic relationships between labels. Table 1 summarizes the discriminant-learning-based concept drift adaptation methods. From this table, it can be seen that the MLP-based concept drift adaptation method focuses on the processing of streaming data samples to ensure the balance between old and new data and imbalanced data, thus improving the accuracy of prediction and reducing errors. However, it has certain limitations in dealing with high dimensionality, which is more suitable for dealing with gradual and abrupt concept drift. The RNN-based algorithm and its variants have good timeliness and can handle long-term serial data. However, it is necessary to overcome the problem of catastrophic forgetting, which is more sensitive to incremental and gradual concept drift. It is worth noting that the types of concept drift adaptation are rarely clearly specified in related studies based on LSTM and CNN. Further, most concept drift adaptation methods face the problem of slow convergence speed. Note: + represents active mode, − represents passive mode, √ represents parameter update, × represents structural update, "A" represents abrupt drift, "I" represents incremental drift, "G" represents gradual drift, "R" represents recurring drift, and "N" means not mentioned in the reference.
In addition to the above types of mainstream algorithms, there are some other methods of discriminant learning. For example, the OeSNN-DRT algorithm based on a spike network [54] introduces two methods: active and passive adaptation methods. It uses the data reduction technique (DRT), a selective and generative data reduction technique, to optimize the contents of the neuronal repository and update its structure. However, it does not take into account a priori information such as the speed and severity of the drift. Currently, there are few studies related to other types of concept drift adaptation methods compared to mainstream deep learning models, so they are not listed. However, it is a worthy direction for research.

Concept Drift Adaptation Methods Based on Generative Learning
Generative learning technologies are often used to describe higher-order correlation attributes or features for pattern analysis or synthesis, as well as joint statistical distributions of visible data and their related classes [55]. Most generative learning is unsupervised learning, but sometimes it can also be used for preprocessing in supervised learning, dimensionality reduction processing, etc. [56]. A generative model learns the data generation process, learns the probability distribution of input data, and generates new samples of data. More specifically, the generative model first estimates the conditional density of the classes P(a|b = k) and the prior class probability P(b = k) from the training data. They tried to understand how the data for each classification was generated. Bayes' theorem is then used to estimate the posterior class probability. Generative models can also learn the joint distribution of inputs and labels P(a, b) and then normalize them to obtain posterior probabilities P(b = k|a). The general process of the conceptual drift adaptive method based on generative learning is shown in Figure 6, while the general parameter update mode accounts for a large proportion, and the proportion of active detection and passive adaptation is comparable. Common deep neural network technologies for unsupervised or generative learning are generative adversarial networks (GANs), autoencoders (AEs), restricted Boltzmann machines (RBMs), self-organizing mapping (SOM), and deep belief networks (DBNs) and their variants. does not take into account a priori information such as the speed and severity of the drift. Currently, there are few studies related to other types of concept drift adaptation methods compared to mainstream deep learning models, so they are not listed. However, it is a worthy direction for research.

Concept Drift Adaptation Methods Based on Generative Learning
Generative learning technologies are often used to describe higher-order correlation attributes or features for pattern analysis or synthesis, as well as joint statistical distributions of visible data and their related classes [55]. Most generative learning is unsupervised learning, but sometimes it can also be used for preprocessing in supervised learning, dimensionality reduction processing, etc. [56]. A generative model learns the data generation process, learns the probability distribution of input data, and generates new samples of data. More specifically, the generative model first estimates the conditional density of the classes P(a|b = k) and the prior class probability P(b = k) from the training data. They tried to understand how the data for each classification was generated. Bayes' theorem is then used to estimate the posterior class probability. Generative models can also learn the joint distribution of inputs and labels P(a, b) and then normalize them to obtain posterior probabilities P(b = k|a). The general process of the conceptual drift adaptive method based on generative learning is shown in Figure 6, while the general parameter update mode accounts for a large proportion, and the proportion of active detection and passive adaptation is comparable. Common deep neural network technologies for unsupervised or generative learning are generative adversarial networks (GANs), autoencoders (AEs), restricted Boltzmann machines (RBMs), self-organizing mapping (SOM), and deep belief networks (DBNs) and their variants. • AE-based concept drift adaptation methods AE mainly consists of an encoder, a code, and a decoder [57]. It is combined with a concept drift adaptation method, which is mainly used for the anomaly detection of some high-dimensional data, such as the detection of the anomalous behavior of elderly people. Typical algorithms include the adaptive framework for online deep anomaly detection (ARCUS), unsupervised statistical concept drift detection (USCDD-AE), deep evolving denoising autoencoder (DEVDAN), and memory-based streaming anomaly detection (MemStream).
ARCUS [58] contains concept-driven inference and drift-aware model pool updates, where concept-driven inference focuses on evaluating the reliability of its models and giving evaluation scores when given a new data point. When a concept drift occurs, its evaluation score will drop to trigger a model pool update. Some models will then be removed and retrained to adapt to the occurrence of concept drift. In this process, the algorithm mainly uses the same structure of the AE model to form a model pool to perform anomaly detection of the data flow, which mainly has a large resource cost and cannot store the current batch of data where concept drift may occur. USCDD-AE [59] uses variational autoencoders to identify the anomalies of elderly people, which detects concept drift based on data from families and the activity probability plot of the Kullback-Leibler divergence, as defined below. • AE-based concept drift adaptation methods AE mainly consists of an encoder, a code, and a decoder [57]. It is combined with a concept drift adaptation method, which is mainly used for the anomaly detection of some highdimensional data, such as the detection of the anomalous behavior of elderly people. Typical algorithms include the adaptive framework for online deep anomaly detection (ARCUS), unsupervised statistical concept drift detection (USCDD-AE), deep evolving denoising autoencoder (DEVDAN), and memory-based streaming anomaly detection (MemStream).
ARCUS [58] contains concept-driven inference and drift-aware model pool updates, where concept-driven inference focuses on evaluating the reliability of its models and giving evaluation scores when given a new data point. When a concept drift occurs, its evaluation score will drop to trigger a model pool update. Some models will then be removed and retrained to adapt to the occurrence of concept drift. In this process, the algorithm mainly uses the same structure of the AE model to form a model pool to perform anomaly detection of the data flow, which mainly has a large resource cost and cannot store the current batch of data where concept drift may occur. USCDD-AE [59] uses variational autoencoders to identify the anomalies of elderly people, which detects concept drift based on data from families and the activity probability plot of the Kullback-Leibler divergence, as defined below.
where Z is the probability space, and G and Q are probability distributions defined over Z. Here G and Q are activity probability maps. Then, when concept drift occurs, the encoder will be updated to adapt to concept drift by backpropagating the reconstruction error. In this process, there are often difficulties with data collection and the possibility of false positives. DEVDAN [60] is an incremental learning method that primarily uses the network significance formula to evaluate the predictive power of the model. Once the value in the capture formula rises, its hidden nodes are adjusted. USCDD-AE and DEVDAN are mainly based on the active concept drift adaptation method but ignore mutation oblivion when adding new layers. MemStream [61] is used for anomaly detection in multidimensional data and concept drift. It first uses a small portion of the training set and extracts features using the denoising autoencoder. Then, when a new sample arrives, the anomaly score is recalculated, and the weighting factor of AE is updated. If the anomaly score exceeds a user-set threshold, the memory is updated in a first in, first out (FIFO) manner, and the model is retrained to accommodate concept drift. This method effectively avoids noise disturbances and retrains quickly but with high resource overheads.
• GAN-based concept drift adaptation methods GAN mainly consists of a generator and a discriminator. The former is used to create new data with similar characteristics to the original data, and the latter is used to determine the authenticity of the given data [62]. There are few examples of GAN combined with concept drift adaptation methods compared to other deep learning techniques, such as the distributed class-incremental learning method based on generative adversarial networks (DCIGAN). DCIGAN [63] uses a GAN generator to store information about past data and constantly updates GAN parameters with new data. Meanwhile, a generative fusion method (GF), which integrates multi-node local generators into a new global generator, is adopted. Particularly, a method for monitoring and evaluating GAN during continuous learning is presented, which explains the concept drift [64]. Its main purpose is to solve the problem of classifying data streams, but different hyperparameters need to be set in different environments.
• RBM-based concept drift adaptation methods RBM is usually made up of visible and hidden nodes, each connected to every other node, which facilitates the understanding of some irregular datasets. Moreover, it is sensitive to the occurrence of concept drift because it is able to learn the probability distribution of the input [65]. RBM-I [66] and RRBM-DD [22] are two typical concept drift detection algorithms proposed by Korycki and Krawczyk, for multi-class imbalance and the presence of adversarial attack data streams, respectively. They both use gradient descent to update the weights in order to maintain the sensitivity of concept drift detection. RBM-IM is not suitable for small data streams and is prone to overfitting. RBM-DD has limitations in identifying adversarial concept drift in dynamic classes of unbalanced data streams. In addition, the Gaussian restricted Boltzmann machine (GRBM) algorithm [67] primarily uses the Kullback-Leibler divergence distance to determine whether a concept drift has occurred, thus enabling the adaptive adjustment of the sliding window and the division of the data stream. It reduces energy consumption and saves memory but only makes judgments on data from a single source and does not adaptively divide heterogeneous data.
• SOM-based concept drift adaptation methods SOM is often applied to create low-dimensional (usually two-dimensional) representations of high-dimensional datasets, while maintaining the topology of the data [68]. The main benefit of using SOM is that it makes high-dimensional data easier to visualize and analyze for understanding patterns. As in the case of GAN, there are few examples of SOM combined with concept drift adaptation methods. An online unsupervised incremental method based on self-organizing maps (OUIM-SOM) [69] is used for multi-label stream classification in infinite delay labeling scenarios. It adopts the online update of neuronal weight vectors and dataset label cardinality to accommodate abrupt and incremental concept drifts. However, its adaptive effect on conceptual drift is limited. Table 2 summarizes the typical algorithms based on generative learning. Among them, AE-based algorithms are mainly used for anomaly detection, and depending on the characteristics of autoencoders, these algorithms use generally unsupervised or semisupervised learning, which can enhance the flexibility of data flow methods in utilizing unlabeled samples. The remaining methods of combining generative learning models with concept drift algorithms, especially the concept drift adaptation methods involved in the deep belief network, have not been found, so they are not presented in this paper. However, there are some other generative learning models involved. A self-organizing incremental neural network (SOINN+) for unsupervised learning from noisy data streams [70] adapts to concept drift by adding or removing nodes, creating or deleting edges, or combining both. SOINN+ is robust to noise and is able to find topological representations that are consistent with the distribution of real data. It is worth noting that the Euclidean distance used in the node similarity metric is not suitable for high-dimensional data.

Concept Drift Adaptation Methods Based on Hybrid Learning
Hybrid deep learning models usually consist of multiple deep underlying learning models, either a free combination of discriminative or generative learning or discriminative/generative learning plus other models, such as CNN + LSTM, GAN + CNN, CNN + SVM, and other algorithms, as shown in Table 3.
The generative model and discriminant model have their own advantages. The generative learning model can learn from unlabeled data and can save labor costs. The discriminant learning model is better than the generative model in supervised tasks. Hybrid deep learning integrates discriminant or generative models according to the target task, and the framework for training deep generative models and discriminant models can enjoy the advantages of both models to solve real-world problems. The general process of the concept drift adaption method based on generative learning is shown in Figure 7, while the general parameter update mode accounts for a large proportion, and the proportion of active detection and passive adaptation is comparable.
tive/generative learning plus other models, such as CNN + LSTM, GAN + CNN, CNN + SVM, and other algorithms, as shown in Table 3.
The generative model and discriminant model have their own advantages. The generative learning model can learn from unlabeled data and can save labor costs. The discriminant learning model is better than the generative model in supervised tasks. Hybrid deep learning integrates discriminant or generative models according to the target task, and the framework for training deep generative models and discriminant models can enjoy the advantages of both models to solve real-world problems. The general process of the concept drift adaption method based on generative learning is shown in Figure 7, while the general parameter update mode accounts for a large proportion, and the proportion of active detection and passive adaptation is comparable. Typical algorithms combined with LSTM include HSN-LSTM, online autoregression with deep long short-term memory (OAR-DLSTM), CausalConvLSTM, and LSTMCNNcda. HSN-LSTM [71] is mainly used for multivariate time-series forecasting. It mainly embeds a novel adaptive and hybrid spiking (AHS) module into LSTM to keep the model capable of long-term prediction and alleviate its catastrophic forgetting problem. At the same time, in order to mitigate the impact of concept drift, it adopts the negative log-likelihood function in the fusion attention module to dynamically adjust the attention score and avoid noise interference. However, the resource costs are relatively high. OAR-DLSTM [72] combines a denoising autoencoder, an autoregressive model, and the deep long short-term memory (DLSTM) method, where the denoising encoder is mainly applied to feature extraction, and ORA and DLSTM are applied to target prediction. In the offline state, it divides the training data into data blocks and then pre-trains and retrains the DLSTM model with the error rate predicted by ORA in each data block to obtain several independent sub-models. In the online state, the results of the two models are weighted using a maximum likelihood estimation to obtain the final time-series prediction output. When the dataset is too large, its performance degrades. B-Detection [73] is primarily used to detect runtime reliability anomalies in MEC services. It uses LSTM and AE models to capture the normal reliability data stream distribution characteristics. A weightbased reservoir sampling technique is then used to sample representative normal reliability data. Finally, the sampled data are used for detection model training, and the detection model is retrained to accommodate conceptual drift based on detection performance. However, the run time is relatively long.
A typical algorithm for the combination of LSTM and CNN is CausalConvLSTM [74], which utilizes CNN to extract spatial features efficiently and the LSTM model for prediction. It determines whether the model needs to be retrained based on the false-positive rate calculated from the rolling window and updates the network weights to accommo- Typical algorithms combined with LSTM include HSN-LSTM, online autoregression with deep long short-term memory (OAR-DLSTM), CausalConvLSTM, and LSTMCNNcda. HSN-LSTM [71] is mainly used for multivariate time-series forecasting. It mainly embeds a novel adaptive and hybrid spiking (AHS) module into LSTM to keep the model capable of long-term prediction and alleviate its catastrophic forgetting problem. At the same time, in order to mitigate the impact of concept drift, it adopts the negative log-likelihood function in the fusion attention module to dynamically adjust the attention score and avoid noise interference. However, the resource costs are relatively high. OAR-DLSTM [72] combines a denoising autoencoder, an autoregressive model, and the deep long short-term memory (DLSTM) method, where the denoising encoder is mainly applied to feature extraction, and ORA and DLSTM are applied to target prediction. In the offline state, it divides the training data into data blocks and then pre-trains and retrains the DLSTM model with the error rate predicted by ORA in each data block to obtain several independent sub-models. In the online state, the results of the two models are weighted using a maximum likelihood estimation to obtain the final time-series prediction output. When the dataset is too large, its performance degrades. B-Detection [73] is primarily used to detect runtime reliability anomalies in MEC services. It uses LSTM and AE models to capture the normal reliability data stream distribution characteristics. A weight-based reservoir sampling technique is then used to sample representative normal reliability data. Finally, the sampled data are used for detection model training, and the detection model is retrained to accommodate conceptual drift based on detection performance. However, the run time is relatively long.
A typical algorithm for the combination of LSTM and CNN is CausalConvLSTM [74], which utilizes CNN to extract spatial features efficiently and the LSTM model for prediction. It determines whether the model needs to be retrained based on the false-positive rate calculated from the rolling window and updates the network weights to accommodate concept drift by the backpropagation through time (BPTT) algorithm. CausalConvLSTM is primarily used for network intrusion detection but has a problem in that it is limited in the types of logs. Another example is LSTMCNNcda [75] for time-series forecasting, which focuses on actively detecting the occurrence of concept drift and updating the LSTMCNNnet model by an online parameter update when a concept drift is detected but with certain restrictions on the normalized time series and window size selection.
In addition, typical algorithms based on hybrid learning include the stacked autoencoder-deep neural network (SAE-DNN), OARIMA-RNN, and recurrent adaptive classifier ensemble (RACE). SAE-DNN [76] actively detects the occurrence of concept drift using the STEPD. If a concept drift occurs, the top level of SAE-DNN is extended by means of random vector function linking (RVFL). The parameters in the extension layer are dynamically assigned to new data through Lasso regularization and L2 regularization.
However, there is a certain amount of noise interference. Adaptive online ensemble learning with RNN and ARIMA (OARIMA-RNN) [39] uses RNN models to capture temporal dependencies and implement online learning modeling. Then, it dynamically adapts to concept drift by adding ARIMA to the set and RNN hyperparameters being optimized with each new batch. It has better accuracy than traditional offline models. However, there was no quantification of conceptual drift or performance during the drift. RACE [77] uses the concept of processing recycling, which uses an MLP, J48 decision tree, and support vector machine as basic learners to process training instances of time-series data. Then, the training instances are processed by the incremental learning algorithm, and the concept is used to detect the occurrence of concept drift. When concept drift occurs, it is updated and retrained. The algorithm requires a large amount of memory to run and slows down convergence as the size of the integration increases. From the summary of typical algorithms in Table 3, it can be seen that "LSTM" + "other models" is a common hybrid approach, which is mainly applicable to long-term streaming data and can overcome the forgetting problem and improve the accuracy to a certain extent. In summary, for hybrid learning methods, multi-model integration is mainly tuned using dynamic weighting between models. So, it is essentially parameter updating, and there are also embedded model combinations that are mainly applied in process industries, such as power forecasting.

Other Concept Drift Adaptation Methods
The deep learning framework classification mentioned above is mainly divided based on the perspective of single-class models or hybrid multiple models. It is worth noting that there are cases where other deep learning technologies [31,78], such as deep reinforcement learning and deep federated learning, are used. For example, deep reinforcement learning was introduced as a combination of deep learning (DL) and reinforcement learning (RL) to better cope with the dynamic changes of unstable environments, leveraging the primary deep learning models to generate the target models we need [79,80]. The general process of this concept drift adaption method is shown in Figure 8, while the structure update mode is more common in deep transfer learning. Parameter updating is more common in deep reinforcement learning. However, relatively few research studies involving concept drift adaption methods are elaborated compared to other classes. The most popular are deep transfer learning (DTL) and deep reinforcement learning (DRL). Therefore, we mainly introduce DTL and DRL. Table 4 summarizes concept drift adaptation methods based on DTL and DRL. forcement learning was introduced as a combination of deep learning (DL) and reinforcement learning (RL) to better cope with the dynamic changes of unstable environments, leveraging the primary deep learning models to generate the target models we need [79,80]. The general process of this concept drift adaption method is shown in Figure 8, while the structure update mode is more common in deep transfer learning. Parameter updating is more common in deep reinforcement learning. However, relatively few research studies involving concept drift adaption methods are elaborated compared to other classes. The most popular are deep transfer learning (DTL) and deep reinforcement learning (DRL). Therefore, we mainly introduce DTL and DRL. Table 4 summarizes concept drift adaptation methods based on DTL and DRL. • DTL-based concept drift adaptation methods DTL mainly uses pre-training of deep learning models to obtain relevant knowledge. Then, by transferring the acquired knowledge to a new model, it can be adapted to a new task with minimal data and can save resources [81]. Currently, there are not many algorithms that involve concept drift. Typical methods are neural network patching (NN-Patching), adaptive mechanisms for learning CNNs (AM-CNNs), and autonomous transfer learning (ATL).
NN-Patching [82] is passively handled concept drift by an error estimator. It mainly constructs a discriminant classifier to identify the misclassified regions. Then, it trains a new classifier (called a patch network) on the misclassified data. The patch network uses the intermediate layer of the original neural network to extract features and representations that are critical to classification. This method keeps the original neural network quickly adaptable to concept drift, but its ability to handle concept drift is limited, and the hyperparameters need to be adjusted for the scene. AM-CNNs [83] uses the nonparametric CUSUM test to actively detect the occurrence of concept drift. It relies on a "transfer learning" paradigm that transfers the knowledge of the CNN running before the concept drift to the CNN running after the concept drift, but the resource overhead is relatively high. ATL [84] uses the autonomous Gaussian mixture model (AGMM) to automatically adjust the network width, which solves the concept drift problem. It is just a matter of readapting to a concept that has been there before when it reappears. An adaptive anomaly detection approach toward concept drift (ADTCD) [85] is an adaptive anomaly detection model based on knowledge distillation and DTL. It transfers knowledge from the AE- • DTL-based concept drift adaptation methods DTL mainly uses pre-training of deep learning models to obtain relevant knowledge. Then, by transferring the acquired knowledge to a new model, it can be adapted to a new task with minimal data and can save resources [81]. Currently, there are not many algorithms that involve concept drift. Typical methods are neural network patching (NN-Patching), adaptive mechanisms for learning CNNs (AM-CNNs), and autonomous transfer learning (ATL).
NN-Patching [82] is passively handled concept drift by an error estimator. It mainly constructs a discriminant classifier to identify the misclassified regions. Then, it trains a new classifier (called a patch network) on the misclassified data. The patch network uses the intermediate layer of the original neural network to extract features and representations that are critical to classification. This method keeps the original neural network quickly adaptable to concept drift, but its ability to handle concept drift is limited, and the hyperparameters need to be adjusted for the scene. AM-CNNs [83] uses the nonparametric CUSUM test to actively detect the occurrence of concept drift. It relies on a "transfer learning" paradigm that transfers the knowledge of the CNN running before the concept drift to the CNN running after the concept drift, but the resource overhead is relatively high. ATL [84] uses the autonomous Gaussian mixture model (AGMM) to automatically adjust the network width, which solves the concept drift problem. It is just a matter of readapting to a concept that has been there before when it reappears. An adaptive anomaly detection approach toward concept drift (ADTCD) [85] is an adaptive anomaly detection model based on knowledge distillation and DTL. It transfers knowledge from the AE-based teacher model to the student model and updates only the student model, which dynamically adjusts model weights to accommodate concept drift primarily through local inference on new samples. However, the algorithm also suffers from two limitations. Firstly, the industrial scenarios used for the experiments are relatively homogeneous, and secondly, little attention is paid to scarce anomaly data.
• DRL-based concept drift adaptation methods DRL combines the perception ability of deep learning with the decision-making ability of reinforcement learning, which can be directly controlled based on the input information. Reinforcement learning defines the goal of optimization, and deep learning gives the mechanism by which it works (how to characterize problems and how to solve them) [86]. The algorithms using this concept drift adapting method are mainly applied in the fields of financial investment and anomaly detection. Typical algorithms include Deep-Pocket, RL4OASD, online ensemble aggregation using reinforcement learning (OEA-RL), and DeepBreath.
DeepPocket [87] is used in the field of financial investment. This algorithm mainly uses a restricted stacked autoencoder to extract features and uses two convolutional networks to find the best portfolio through deep reinforcement learning. Then, it uses online training to dynamically update weights to accommodate concept drift, but it does not lend itself to a long-term investment strategy. RL4OASD [88] is mainly used for the detection of abnormal trajectories of vehicles. It includes two networks: one is responsible for learning the features of the road network and trajectory, and the other is responsible for detecting anomalous sub-traces based on the learned features. The two networks can be trained iteratively without labeled data, and they employ an online learning strategy; that is, they are trained with newly recorded trajectory data and continuously update their strategies based on current traffic conditions, but they have a longer training time. OEA-RL [89] mainly uses the deep reinforcement learning framework as a meta-learning method to learn linear weighted ensembles and actively detects the occurrence of concept drift through the Page-Hinkley (PH) test. Then, it adapts to concept drift by updating its parameters. Again, there is a certain delay in updating due to its active detection algorithm. DeepBreath [6] is mainly used for financial investment, which uses a limited superimposed autoencoder for dimensionality reduction and feature processing. Then, the SARSA algorithm and the online batch processing method are used to train CNN learning investment strategies from historical data, and after training the model, the weights are updated through the online learning scheme to adapt to the concept drift. The algorithm lacks, to some extent, the consideration of exogenous factors.
As can be seen in Table 4, for DRL and DTL, the update mode of DRL is mainly a parameter update. It interacts well with the environment to learn the sequence of its behavior. The update mode of DTL is mainly a structural update. DTL can effectively use a small amount of data to train neural networks. This characteristic can use structure updates to train better predictive models. In addition, they generally use a combination of online and offline approaches to adapt concept drift and support more complex predictions. In addition to DTL and DRL, two more popular deep learning methods, there are also concept drift adaption methods based on other deep learning technologies. Such as FedHAR [90], which is a smart human activity recognition (HAR) frame based on deep federated learning. FedHAR designs an unsupervised gradient aggregation strategy that can overcome the problem of concept drift and convergence instability in online learning, which is mainly used to summarize the gradients of all labeled clients and unlabeled clients in federated learning and then drive the parameter update of the server model by averaging the aggregate gradient to adapt to the concept drift.

Discussion
According to the summary of concept drift adaptation methods, we can see that the proportion of hybrid learning and discriminant learning is relatively large, and especially discriminative learning is widely used. This phenomenon reflects the fact that having label information samples is beneficial for detecting changes in the distribution of data. In addition, parameter updates also account for a large part. Compared with structural updates, parameter updates reduce the convergence time and adapt well to abrupt concept drift. Secondly, in the algorithms of discriminant learning and generative learning, its active detection also accounts for a considerable part, and it is mainly conducive to explaining the occurrence of concept drift and reducing the computing resources of training, but to a certain extent, it requires additional memory and CPU storage. From this paper, it can be seen that dealing with concept drift, reducing the amount of computation, saving resources, and speeding up convergence are our main challenges at present.
In addition, according to the above summary of drift adaptation types, we can find that there are usually more adaptation methods for abrupt, incremental, and gradual drift types. Relatively speaking, abrupt drift occurs most frequently, and its drift speed occurs the fastest, so most detection methods can be sensitive to detection, but there will also be problems such as update delay and high computational complexity. In contrast, recurring drift occurs the least often. In the case of recurring drift, previously learned models may become relevant again in the future. Online deep learning algorithms may have to relearn previous concepts. This process has a high computational burden because it means tuning or training a new model from scratch. This is also one of the main challenges at this time. Finally, it should be added that in addition to being based on deep learning algorithms, extreme learning machines are also models based on neural networks. In recent years, the use of the concept drift algorithm of ELM has also increased, and the main algorithms include Meta-RKOS-ELM [91], SSOE-FP-ELM [92], ONLAD [93], etc., which is also a worthy research direction.

Performance Evaluation of Concept Drift
In this section, we summarize the common datasets and evaluation metrics. The datasets are divided into real datasets and synthetic datasets. For the former, we present its sources, learning tasks, and properties. For the latter, we show the drift types and characteristics it contains. After that, we describe the evaluation metrics and their meanings.

Datasets
Real datasets can effectively demonstrate the generality and applicability of the algorithm in the real world, for which the commonly used datasets are KDD CUP 1999, Electricity, Weather, Spam, and CoverType. The KDD CUP 1999 [94] is the dataset used in the KDD (knowledge discovery and data mining) competition. It is mainly used for network intrusion detection to distinguish between normal network connections and malicious network connections, including various attack data simulated in the military network environment. Electricity [95] is derived from the electricity market of New South Wales, Australia (1996)(1997)(1998). It is mainly used to predict changes in electricity prices in the past 24 h, including the weather, user demand, supply conditions, and seasons. Weather [96] contains daily weather measurement data for a certain area from 2006 to 2016, including temperature, humidity, wind direction, wind speed, visibility, atmospheric pressure, etc., for predicting rainfall. Spam [97] is mainly used to identify spam. CoverType [94] is derived from the forest cover of a certain area in the U.S. Forest Service system. Synthetic datasets can evaluate the performance of the algorithms under different concept drift situations and contain various types of concept drift. For detecting abrupt concept drift, R.MNIST [98], P.MNIST [60], and SEA [99] can be used. SEA contains three features and two classes in each sample. R.MNIST and P.MNIST are generated from the MNIST dataset containing 784 features and 10 classes. It is worth noting that P.MNIST also detects recurring drift. For detecting gradual concept drift, Circles [100], Hyperplane [101], and LED [94] can be used. Circles contains two features and two classes in each sample. Then, Hyperplane also detects incremental drift, containing 10 features and 2 classes in each sample. LED also detects abrupt drift, containing 24 features and 10 classes in each sample.
In addition to the above commonly used datasets, there are some special datasets for deep learning frameworks, such as the Vxheaven dataset [102] commonly used in previous malware analysis studies, consisting of Windows binaries belonging to malware and benign portable executables, containing different types of malware families, such as trojans, ransomware, and viruses. HAR-UCI [103] was made from recordings of 30 subjects performing activities of daily living. The STL-10 dataset [104] is an image recognition dataset for the development of unsupervised feature learning, deep learning, and selflearning algorithms. The Cat-Dog dataset [105] contains two classes, cats and dogs, with 12,500 images. The CIFAR100 dataset [106] is utilized to simulate the distribution drifting situation. It has 60,000 32 × 32 × 3 RGB images. Finally, some researchers have used their own collected data, as well as datasets from the application domain. For example, the I-LSTM and ECNN algorithms are collected data, and CausalLSTM uses the HDFS dataset [107] and the Cybersecurity's Intrusion Detection Evaluation dataset [108].

The Evaluation Metrics
For algorithms based on discriminative learning and hybrid learning, accuracy recall, precision, F 1 -score, Matthews' correlation coefficient (MCC), and Cohen's kappa k are mainly used for classification problems, and mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE) are mainly used for regression problems. For algorithms based on generative learning and others, the number of hidden nodes per time step (HN), the number of hidden layers per time step (HL), parameter count (PC), and execution time (ET) are mainly used. In particular, they are unique evaluation metrics under the framework of deep learning. It is worth noting that the MCC and Cohen's kappa k evaluation metrics are mainly used for unbalanced data. The definition of MCC is shown in Equation (7).
where TP represents true positives, TN represents true negatives, FP represents false positives, and FN represents false negatives. The definition of Cohen's kappa is shown in Equation (8).
where P 0 and Pe are the success rate of the actual and random predictors.
In addition to the above basic evaluation metrics, some researchers also use some evaluation metrics for the application field of algorithms. For example, the DeepPocket algorithm, which is mainly used in the field of financial investment, mainly uses maximum drawdown (MDD), Sharpe ratio (Sr), and conditional value at risk (CVaR) to evaluate its performance.

Future Directions
Based on the analysis and discussion of the above algorithms, we summarize the problems that need to be solved, which require further research in the future, as described below.

Full Coverage of Concept Drift Types
According to the above-mentioned algorithms, such as ONU-SHO-RNN, DEVDAN, etc., we can find that it is not possible to adapt to all concept drifts at once, and among the four types of concept drift, the best adaptability of the algorithm is to abrupt drift. There are also some algorithms, such as ECNN and HSN-LSTM, for which the dataset used in the experimental part does not indicate the type of drift included, and there are also no experiments on the effectiveness of different types of concept drift. Therefore, it is necessary to improve the robustness and generalization of the methods to study concept drift.

Data Processing Problem
Data processing has been a big challenge in deep learning and concept drift adaptation methods [4,21]. Firstly, when inputting samples, we may face problems such as classimbalance data, high-dimensional data, etc. For example, when performing online anomaly detection, most of the datasets are very unbalanced, and the abnormal data account for a very small part [109]. Secondly, when the model update is performed, we will face the problem of how to balance between new data and old data and the problem that the new data samples are not enough to support the update of the deep learning model after the concept drift occurs. These will lead to poor prediction, slow model convergence, delayed model update, and other consequences that are worthy of our consideration and research.

Multi-Model Integration Problem
Our review shows that online integration methods have been more popular in concept drift adaptation methods, such as OARIMA-RNN. Ensemble algorithms can effectively prevent overfitting and provide better prediction performance. However, their computational complexity is high, and they take up more resources, so how to optimize their performance is also a question worthy of deep consideration [110].

Visualization Problem of Concept Drift
At present, there is relatively little research on concept drift visualization. Classic visualizers have DriftVis [111], which can help decision makers identify and correct concept drift in data streams. In fact, for many related fields such as air quality monitoring, financial market analysis, etc. [7,112], explaining concept drift is conducive to helping decision makers comprehensively analyze problems and make correct decisions. Finally, it is worth mentioning the application of conceptual drift type visualization.

Conclusions
In recent years, deep learning has become one of the core technologies of the fourth industrial revolution. So, it has also become one of the indispensable tools to assist intelligent decision making. However, in the era of the epidemic and big data, data distribution in streaming data can change very easily, which is a phenomenon known as concept drift. Once concept drift occurs, even the best-trained deep learning models become obsolete, producing poor predictions. Therefore, this paper summarizes concept drift adaptation methods under the deep learning framework. Firstly, we explain the definition and causes of concept drift. Then, we introduce the types of concept drift and the general process of a concept drift adapting method under the deep learning framework. Next, we divide the deep learning model using the concept drift adaptation method from four aspects, including discriminant learning, generative learning, hybrid learning, and others. For each aspect, we introduce in detail the update modes, detection modes, and adaptation drift types of concept drift adaptation methods. Finally, we summarize common datasets and evaluation metrics for concept drift adaptation methods and point out future directions. We hope that this paper can provide some academic help to researchers.

Conflicts of Interest:
The authors declare that there are no conflict of interest regarding the publication of this paper.

Abbreviations
The