A Classification and Novel Class Detection Algorithm for Concept Drift Data Stream Based on the Cohesiveness and Separation Index of Mahalanobis Distance

,


Introduction
In recent years, with the continuous popularization of the Internet and the continuous development of the Internet of ings and data acquisition technology, data has exploded. A constantly changing time-stamped data model, the data stream, has emerged in the Internet, finance, medicine, and ecological monitoring. After the advent of the Internet and wireless communication networks, data flow as a new type of data model has attracted more and more attention from the society [1,2]. e data stream has characteristics different from traditional datasets. It has chronological, rapid changes and massive, potential infinite, etc. characteristics. It is precisely because of the unique characteristics of the data stream that the data processing model of the data stream is very different from the traditional data mining technology. e data processed by the traditional data mining technology are static datasets, which can be permanently stored in the medium and can be scanned and used multiple times during the process of data analysis.
Unlike traditional static databases, the data processing model of the data stream is updated at a faster rate and continuously flows into and out of the computer system. Accordingly, the two biggest challenges in processing data from a data stream are its inherently infinite length and the concept drift that occurs in real-time data changes. Concept drift means that the statistical properties of the target variables that the model attempts to predict change over time in an unpredictable manner. erefore, using traditional data mining techniques, it is impractical to store and use all historical data for training, which makes it necessary to change existing data mining techniques and design new mining algorithms for this new data model. Data flow novel class detection is a technique for detecting new categories in a data stream. Many traditional data stream classification algorithms use fixed class numbers to train data stream classifiers. However, in reality, outliers and novel class will appear in the data stream over time, which will lead to a gradual decline in the accuracy of the traditional data stream classification algorithm. erefore, it is urgent to design a novel class detection algorithm for the characteristics of data flow. e rest of this paper is organized as follows: Section 2 introduces the relevant research on data stream classification and novel class detection. Section 3 details the C&NCBM algorithm. Section 4 describes the experimental results and detailed analysis in different datasets. e conclusion of the research as well as challenges and directions for future research is presented in Section 5.

Data Stream Classification in the Presence of Concept
Drift. In the literature [3], various learning algorithms in the context of concept drift in recent years are reviewed. In 1986, Schlimmer and Granger [4] first proposed the "concept drift," which was followed by the increasing attention of the academic community. From 1986 to 2000, research focused on the use of a single classifier to implement concept drift data stream classification. Widmer and Kubat proposed CBBIT [5], and Hulten et al. proposed methods such as FLORA [6]. At the same time, researchers began to pay attention to the theoretical problem of concept drift data stream classification.
Due to the need to continuously update the classification model when using the single classifier to process the concept drift data stream and the fact that the generalization ability of the classifier is not high [7], Black and Hickey [8] proposed the introduction of integrated learning into the concept drift data stream classification for the first time and proposed the AES algorithm. erefore, after about 2000, people began to turn to the integrated classifier for the study of concept drift data streams. At this time, the concept drift data stream classification research entered a period of rapid development and began to study the concept drift data stream closer to the reality. Klinkenberg and Lanquillon earlier studied the concept drift in some cases with user feedback or with no feedback [8][9][10][11]. In 2004, the Intelligent Data Analysis Journal published the concept drift data stream special issue [12] that mainly discussed how to use the incremental learning method to make the existing classifier use concept drift at a small cost. Subsequently, more attention has been paid to issues such as class imbalance learning [13,14], concept repetitive learning [15,16], semisupervised learning [17,18], and active learning [19,20] in the classification of concept drift data streams. Table 1 summarizes the main three types of concept drift data stream classification techniques from 2000 to 2016.

Novel Class Detection in the Presence of Concept Drift.
In the literature [33], Masud et al. proposed a novel class detection method in the data stream with concept drift and infinite length. However, this method does not address the problem of feature evolution. In the literature [34], the problem of evolution of the concept is solved while solving the problem of conceptual evolution, but the literature [33,34] still has too high false alarm rate for some datasets and cannot distinguish different novel class problems. Masud et al. [35] proposed a method to solve the concept evolution caused by the emergence of novel classes. is method adds an auxiliary classifier set to the main classifier set. When each arriving instance in the data stream is determined to be a secondary outlier by the primary classifier set and the associated classifier set, it is temporarily stored in a buffer. When there are enough instances in the buffer, the novel class detection module is called for detection. If a novel class is found, the novel class instance is marked accordingly. In the literature [36], the feature space transformation technique is proposed to deal with the evolution of data stream feature. e traditional data stream integration classifier is combined with the novel class detection technology to solve the feature evolution problem in the data stream.
Chandak [37] proposed a string-based data stream processing method, which mainly solves the problem of data stream concept evolution through the CON_EVOLUTION algorithm. Miao et al. [38] solved the problem that only the numerical data can be solved in the framework of MineClass algorithm. A novel class detection algorithm that can process mixed-attribute data is proposed, and the processing time and model size of the algorithm framework are optimized by using VFDTc classifier. ZareMoodi et al. [39] used local patterns and neighbor graphs to solve the concept evolution problem in data streams. Local patterns are Boolean feature groups that affect sequential features and classification features, which are used to improve classification accuracy. At the same time, in candidate novel class classes, neighbor graphs are used to analyze interrelated objects to improve the accuracy of novel class detection.
After many researchers have continuously explored it, novel class detection has achieved many results. However, most of the novel class algorithms cannot solve the problem of multiple novel class problems at the same time and also do not consider the interaction of different attributes in the instance to determine the novel class. erefore, based on the previous studies and considering the role of attributes, this paper proposes a novel class detection algorithm that can distinguish different categories of novel class.

Cohesion and Separation Index Based on Mahalanobis
Distance. Based on the Mahalanobis distance [40] and the cohesive separation index N-NSC proposed by Masud et al. [33], a novel class detection index is proposed. e relevant definitions are as follows.
Definition 1 (R-outlier) (see [33]). Let x be the test point and C min be the clustering result point closest to x. If x is outside the range determined by the feature space contained in C min , then x is an R-outlier.
Definition 2 (F-outlier) (see [33]). If x is an R-outlier for all classifiers E i in the classification set E, then x is an F-outlier.
Definition 3 (λ c -neighbor) (see [33]). e λ c -neighbor of the F-outlier x is the set of n neighbors closest to x in class c, denoted by the symbol λ c (x), where n is a user-set parameter. According to the above definition, we give the definition of the cohesiveness and separation index MN-NSC based on Mahalanobis distance.
Definition 4 (MN-NSC). Let ma(x) be the average Mahalanobis distance of F-outlier x to λ o (x), mb e (x) be the average Mahalanobis distance of F-outlier x to λ e (x), and mb min (x) be the minimum of mb e (x); then MN-NSC is defined as follows: where λ o (x) represents the λ c -neighbor of x to other F-outliers and λ e (x) represents the λ c -neighbor of x to its existing class. By definition, the value of MN-NSC is within the interval [−1, 1]. When MN-NSC is negative, it means that x is closer to the existing class and it is far away from the F-outlier; when MN-NSC is positive, it means that x is farther from the existing class and close to the F-outlier. When at least N (>n) F-outliers have an MN-NSC value greater than 0, this indicates that a new heterogeneity is generated in the data stream.

Algorithm.
is section will elaborate the algorithmic process of classification and novel class detection algorithms based on the Mahalanobis distance cohesive separation index, and it will analyze the concept drift processing in the data stream.
First, the data stream is divided into data blocks of the same size, and the last arriving data block D i , the currently optimal m classifier sets M, the nearest neighbor n, and the novel class threshold β are taken as input of the algorithm. en, the instances in the data block are classified to determine whether the instance is R-outlier. If the instance is R-outlier, it will be added to the exception set F. k-means is used to cluster the instances in the set F and create a cluster point Fp k for each cluster. e Fp k saves the cluster center and clustering radius of each cluster and calculates MN-NSC value for each cluster point Fp k . If the number of cluster points with MN-NSC value greater than zero is greater than the set threshold, the algorithm determines that novel class is generated and classifies it. When all data in D i is marked, D i is used to train a new model M m+1 . M i , the model with the lowest classification accuracy, is selected from the set M and replaced with M m+1 . rough the above method, the classification model of the current latest concepts can be maintained at any time, so as to solve the concept drift problem in the data flow (Algorithm1). e pseudocode for the algorithm is shown below. OHT 2014 e misclassification rate is used to control node splitting, and the concept drift is solved based on misclassification classes and false alarm rates. [23] Hoeffding-ID 2016 Bayes' theorem is combined with traditional Hoeffding trees. e new spanning tree is continuously used in the classification process to replace the old spanning tree so that the classifier maintains high accuracy and adapts to the data flow concept drift. [24] Cluster-based

CluStream 2003
Extending the traditional clustering algorithm BIRCH to the data flow scenario has strong flexibility and scalability, but it is sensitive to outliers. [25] DenStream 2006 Microclusters are used to capture summary information about a data stream, which can find clusters of arbitrary shapes in the data and have the ability to process noise objects. [26] IEBC 2014 e clustering framework is integrated with the classified data stream using sliding window technology and data marking technology, which is excellent in clustering results and detection concept drift but can only process classified data. [27] MuDi-Stream 2016 e multidensity classification problem in the concept drift data stream is solved by a hybrid method based on network and microclusters, but it is not suitable for highdimensional data streams. [28] Integrated learning

AWE 2003
K classifiers are fixedly constructed, and a new classifier is trained in batch mode using the new arrival data object. Subsequently, the k most accurate classifiers are selected to form a classifier set, and each classifier is weighted according to the accuracy. [29] AE 2011 It mainly solves the problem of data stream mining noise and is a collection of horizontal and vertical integration framework methods. e time complexity is high. [30] EM 2013 Concept drift and novel class in the data stream can be automatically detected, but only concept drift under dynamic feature sets can be handled. [31] CLAM 2016 It uses a class-based integrated classifier to efficiently classify data flow loop classes and novel classes, but it cannot classify multiclass data. [32]

Experiment and Analysis
In order to verify the classification and novel class detection algorithm based on the Mahalanobis distance cohesive separation index proposed in this paper, three sets of experiments were performed on two real datasets and one synthetic dataset. KNN (K-Nearest Neighbor) [41] was selected as the total data stream classifier of C&NCBM algorithm to confirm the final prediction category of the instance. e essence of the algorithm proposed in this paper is based on KNN. In order to verify the effectiveness of the algorithm, the algorithm that uses KNN to classify the data flow alone and MineClass [33]  e ArtificialCDS dataset is a random concept drift data stream that is automatically generated by MOA. e data stream contains 5 classes with a total of 100,000 instances, and the attribute dimension of each sample is 27.

Classification Accuracy.
is experiment uses the accuracy [42] and evaluation time [33] of the classification algorithm to evaluate the quality of different algorithms, which is a widely used evaluation standard in the field of classification algorithms. We expect a good classification algorithm to satisfy the short evaluation time while ensuring high classification accuracy. [43] is an indicator for assessing classification accuracy.

Kappa Statistic. Kappa Statistic
where p o is the proportion of the classifier's agreement, that is, the total number of samples of each correct classification divided by the total number of samples, and p e is the proportion of the random classification agreement.

Experimental Results and Analysis.
is section separately compares and verifies the proposed algorithm classification performance and the algorithm's effect on the concept drift, giving the result analysis. Add x to the set F (5) end if (6) end for (7) Clustering F by k-means (k � n * |F|/|D i |) and creating a cluster point Fp k for each cluster (8) for Each cluster in Fdo (9) Compute MN-NSC (Fp i ) (10) if MN-NSC (Fp i ) is greater than 0 then (11) count � count + 1 (12) end if (13) end for (14) if count greater than βthen (15) Put all instances x belonging to novel class in block D i into class C (16) end if (17)

Experiment 1.
According to the experimental objectives described above, we selected the Covertype, KDD Cup 1999, and ArtificialCDS datasets as experimental datasets and compared the classification accuracy and evaluation time of C&NCBM, MineClass, and KNN alone in the above three datasets. In this experiment, the specific values of the algorithm parameters of different datasets are shown in Table 3. e experimental results on the three datasets are shown in Tables 4-6.
It can be seen from the experimental results in Tables 4-6 that, in the whole data stream classification process, compared with the other two algorithms, the classification accuracy of C&NCBM is very stable throughout the experiment and is significantly higher than that of the other two. e algorithm MineClass also has a better classification effect than that of using KNN alone. e evaluation time of C&NCBM is significantly longer than that of the other two algorithms, and the difference between the evaluation time of MineClass and the time of using KNN alone is small. C&NCBM has higher accuracy than MineClass, but it also requires more evaluation time.
e results of three sets of experiments on two real datasets and one artificial dataset show that the algorithm proposed in this paper is used to deal with the classification of data streams with concept drift and novel class, which has the following characteristics. (1) It is able to make timely    judgments when a novel class appears in the concept drift data stream, and adaptively update the original model after making it, which has stronger classification robustness to novel class occurrences in the concept drift data stream. (2) Compared with the use of ordinary classifiers, there is a significant improvement in classification accuracy, and the classification accuracy is improved to a certain extent compared with the classification and novel class detection algorithms MineClass [33] based on Euclidean distance. (3) e evaluation time is slightly longer than that of the other algorithms.

Experiment 2.
e appearance of concept drift in the data stream indicates that the mapping relationship between attributes and categories has changed, and the classifiers on the data stream are based on this mapping relationship. When the attribute-to-category mapping relationship changes, the classification accuracy index Kappa Statistic of the classifier will inevitably change significantly. erefore, in this section, we will use the difference of classification accuracy of the classifier to determine the sensitivity of different algorithms to the concept drift.
We selected Covertype and ArtificialCDS datasets as experimental datasets and compared C&NCBM, MineClass, and KNN classification accuracy index Kappa Statistic in these two datasets, respectively. e comparison results on the datasets are shown in Figure 1.
In order to introduce the concept drift, we rearranged the Covertype dataset so that at most 3 and at least 2 categories appear in any block at the same time, and new categories appear randomly. e concept drift of the arranged Covertype dataset is mainly in blocks 3 and 5. e ArtificialCDS dataset automatically generated by MOA is incremental drift, which mainly appears in blocks 4 and 6.
e results of Figure 1 show that KNN has the fastest decline  in classification accuracy index Kappa Statistic because of the lack of concept drift processing mechanism. MineClass is partially affected, but the decrease is smaller than KNN. C&NCBM is the least affected by concept drift, and the classification accuracy curve is the most gradual. When the concept drift occurs in the data stream, all the three algorithms will be affected to a certain extent. e C&NCBM algorithm proposed in this paper has better concept drift adaptability and can reduce the influence of concept drift on classification to some extent.

Conclusion
In this paper, an MN-NSC based on the Mahalanobis distance cohesive separation index is proposed. On this index, a classification and novel class detection algorithm, C&NCBM, based on Mahalanobis distance is proposed. Different from the traditional distance measurement between the examples using Euclidean distance, this method pays more attention to the similarity between instances and can sensitively test small changes between outliers. In the comparative experiment using KNN algorithm and MineClass algorithm, the effectiveness of the classification algorithm is verified. e C&NCBM algorithm, KNN algorithm, and MineClass algorithm classification accuracy Kappa Statistic are also compared.
e results show that the proposed C&NCBM algorithm is the best. e concept of drift adaptability can deal with the influence of concept drift on classification in the data stream to some extent. However, due to the problem of adding Mahalanobis distance, the algorithm proposed in this paper requires slightly longer time compared to the other algorithms. How to improve the computational time while ensuring the validity of algorithm classification is the future research direction of this paper.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.