Classification of the drifting data streams using heterogeneous diversified dynamic class-weighted ensemble

Data streams can be defined as the continuous stream of data coming from different sources and in different forms. Streams are often very dynamic, and its underlying structure usually changes over time, which may result to a phenomenon called concept drift. When solving predictive problems using the streaming data, traditional machine learning models trained on historical data may become invalid when such changes occur. Adaptive models equipped with mechanisms to reflect the changes in the data proved to be suitable to handle drifting streams. Adaptive ensemble models represent a popular group of these methods used in classification of drifting data streams. In this paper, we present the heterogeneous adaptive ensemble model for the data streams classification, which utilizes the dynamic class weighting scheme and a mechanism to maintain the diversity of the ensemble members. Our main objective was to design a model consisting of a heterogeneous group of base learners (Naive Bayes, k-NN, Decision trees), with adaptive mechanism which besides the performance of the members also takes into an account the diversity of the ensemble. The model was experimentally evaluated on both real-world and synthetic datasets. We compared the presented model with other existing adaptive ensemble methods, both from the perspective of predictive performance and computational resource requirements.


INTRODUCTION
Nowadays, the size of data is growing in a much faster fashion than in the past. Information is being collected from household appliances, tools, mobile devices, vehicles, sensors, websites, social networks, and many other devices. An increasingly large number of organizations are starting to analyze large volumes of data, as the information obtained from these data can provide a competitive advantage over other businesses. Data collection from devices is often continuous, and data come in the form of data streams.
Data stream classification is an active field of research, as more data sources can be considered as streaming data. When solving classification tasks using streaming data, the data generation process is not strictly stationary, and its underlying structure may change over time. The changes in the underlying data distribution within the streams may result in dynamic, non-stationary target concepts (Gama et al., 2014;Žliobaite, 2010).
This phenomenon is called concept drift, and from the perspective of the training of classification models on the drifting data, the most crucial requirement is the ability of the model to adapt and incorporate new data into the model in order to react to potential changes . In concept drift, the adaptive learning algorithms are advanced machine learning methods that can reflect the changing concepts in data streams in real time. Multiple approaches were proposed to extend the standard machine learning models with the ability to adapt to the changes in streams, including drift detectors (Gonçalves et al., 2014;Baena-García et al., 2006) and various sliding window techniques (Bifet & Gavaldà, 2007).
Ensemble models are a popular classification method, often providing better performance when compared to the standard machine learning models (Breiman, 1996(Breiman, , 2001Freund & Schapire, 1996). When processing dynamic data streams, where the concepts change over time, dynamic adaptive ensembles present a suitable method that retains long-present historical concepts and covers newly appearing ones. Ensemble methods proved to be capable of handling streaming data by updating its base learners (Kolter & Maloof, 2007;Bifet et al., 2015;Brzeziński & Stefanowski, 2011;Gomes et al., 2017) in either block-based, batch mode or instance-based, incremental mode. One of the crucial aspects of the ensemble classifiers for static data is to ensure the diversity of the base classifiers within the ensemble. In contrast to the diversity of the ensembles for the static data (Carney & Cunningham, 2000;Kuncheva & Whitaker, 2003), which is fairly well studied in the literature, there are fewer studies dealing with the ensemble's diversity in the presence of concept drift (Brzezinski & Stefanowski, 2016;Abassi, 2019).
The main objective of the work presented in this paper is to present the design and implementation of a novel adaptive ensemble classification algorithm. The primary motivation of this study is to design a model capable of handling various types of drifts. The proposed method can be characterized as a heterogeneous, chunk-based approach that utilizes different types of base classifiers within the ensemble. The model also uses Q statistic as a metric to measure the diversity between the ensemble members and dynamic weighting scheme. We assume that the creation of a heterogeneous ensemble consisting of different base learners with an adaptation mechanism that ensures the diversity of its members can lead to a robust ensemble that is able to handle the drifting data efficiently. To confirm these assumptions, we conducted an extensive experimental study, where we evaluated the proposed model performance on the selection of both real-world and synthetic, generated datasets with different types of concept drift. We also compared the proposed method to 12 adaptive ensemble methods, including state-of-the-art adaptive ensembles. The main objective was to compare the performance of the methods using standard evaluation metrics, as well as training times and resource consumption.
The paper is organized as follows. The background section provides basic definitions of the terms related to the data streams and the concept drift therein. The following section is dedicated to the state of the art in the area of the adaptive ensemble models used to handle the concept drift and defines the motivation for the presented approach. The following section describes the designed and implemented adaptive ensemble method.
We then describe the datasets used in the experimental evaluation. The experimental results section summarizes the conducted experiments and achieved results. Concluding remarks and possibilities for future work in this area are then summarized.

BACKGROUND
A data stream is defined as an ordered sequence of data items, which appear over time. Data streams are usually unbounded and ordered sequences of the data elements hx 1 ; x 2 ; . . . ; x j ; . . . ; i sequentially appearing from the stream source item by item (Gama, Aguilar-Ruiz & Klinkenberg, 2008). Each stream element is generated using a probability distribution P j . The time interval between the particular elements on the stream may vary. Particular data elements in the stream are usually of standard size, and most of the data streams are generated at high speed, e.g., the elements in the data streams appear rapidly.
Compared with static data, which is usually analyzed offline, data streams need to be analyzed in real time. The differences in the processing of the data streams compared to the processing of the static data can be summarized according to Gama (2010): The data elements in the stream arrive online or in batches. Elements appearing online are processed one by one, and batches usually have the same size and are processed at once.
The processing system has no control over the order in which data elements appear, either within a data stream or across multiple data streams. Data streams are potentially unbound in size. The size of the particular elements is usually small. In contrast, the entire data size may be huge, and it is generally impossible to store the whole stream in memory. Once an element from a data stream has been processed, it is usually discarded. It cannot be retrieved unless it is explicitly stored in memory.
The data streams can be divided into two groups: stationary data streams-the data distribution does not change over time, e.g., the stream elements are generated from a fixed probability distribution; non-stationary data streams-data are evolving, and the data distribution may change over time. Usually, these changes may also affect the target concepts (classes).

Concept drift
When solving predictive data analytical tasks on the static data, the data distribution usually does not change, and data used for the training and testing of the model have the same distribution. When processing the data streams, we often observe the changing nature of the data. In predictive data stream analytical tasks, we experience a phenomenon called concept drift.
Concept drift is related to the data distribution P t (x,y), where x ¼ ðx 1 ; x 2 . . . x n Þ is a data sample represented by an n-dimensional feature vector appearing at time t, and y represents the target class. The concepts in the data are stable (or stationary) if all the data samples are generated with the same distribution. If there is an x in the interval between t and t + Δ, which holds the expression P t (x,y) ≠ P t + Δ (x,y), then concept drift is present (there is a change in the underlying data distribution) (Žliobaite, 2010). Concept drift usually occurs in a non-stationary and dynamically changing environment, where the data distribution or relation between the input data and the target variable changes over time.

The authors in
Sudden/Abrupt-In this case, the concept change occurs suddenly. A concept (e.g., a target class) is suddenly replaced by another one. For example, in the topic modeling domain, the main topic of interest may unexpectedly switch to a different one. Incremental-Changes in the data distribution are slower and proceed over time.
Changes are not as visible as in a sudden drift, but they gradually emerge. The changes are usually relatively slow and can be observed when comparing the data over more extended time periods. Gradual-In this drift type, both concepts are present, but over time, one of them decreases, while the other one increases. For example, such a change may reflect the evolution of points of interest, e.g., when a point of interest is gradually being replaced by a newer one. Re-Occurring-A previously active concept reappears after some time. Re-occurrence may appear in cycles or not (e.g., reappearing fashion trends).
Besides the mentioned drift types, some publications (Gama et al., 2014) distinguish between two kinds of concept drift: real drift and virtual drift. Virtual concept drift is defined by the changes in data distribution but does not affect the target concept.
Real concept drift (also called concept shift) represents a change in the target concept, which may modify the decision boundaries. Figure 1 visualizes the drift types. In concept drift detection, it is also necessary to distinguish between the drift and outlier occurrence. Outliers may produce false alarms when detecting the concept drift.
When processing the non-stationary drifting streams, the necessary feature of the predictive algorithms is their ability to adapt. Some of the algorithms are naturally incremental (e.g., Naive Bayes), while others require significant changes in the algorithm structure to enable incremental processing. Therefore, the learning algorithms applied on the drifting streams are usually extended with a set of mechanisms, which enhance the models with the ability of continuously forgetting the obsolete learned concepts and updating the model with the newly arrived data in the stream. There are several types of models used to handle concept drift. To detect the concept drift in data streams, we can use drift detectors. These can detect possible concept drift by analyzing the incoming data or by monitoring the classifier performance. Detectors process the signal from the data about changes in data stream distribution. Drift detectors usually signalize drift occurrence and trigger the updating/replacement of the classifier. There are several drift detection methods available (Gonçalves et al., 2014), and the Drift Detection Method (DDM) (Gama et al., 2004), the Early Drift Detection Method (EDDM) (Baena-García et al., 2006), and ADWIN (Bifet & Gavaldà, 2007) are the most popular.
For predictive data modeling applied on the drifting streams, advanced adaptive supervised machine learning methods are used. Supervised learning methods used for drifting stream classification could be categorized from several perspectives, depending on how they approach the adaptation (Ditzler et al., 2015;Krawczyk et al., 2017): Active/Passive-Active methods usually utilize drift detection methods to detect the drift and to trigger the model update. Passive methods periodically update the model, without any knowledge of the drift occurrence. Chunk-Based/Online-Chunk-based methods process the streaming data in batches (each batch consists of a specified fixed number of stream elements). Online methods process the stream elements separately when they appear.
Ensemble models represent a popular solution for the classification of drifting data streams. An ensemble classification model is composed of a collection of classifiers (also called base learners, ensemble members, or experts) whose individual decisions are combined (most often by voting) to classify the new samples (Sagi & Rokach, 2018). The main idea of the ensemble model is based on the assumption that a set of classifiers together can achieve better performance than individual classifiers (Kuncheva & Whitaker, 2003). The selection of the ensemble experts is a crucial factor, as the ideal ensemble consists of a set of diverse base learners. Ensemble models are also suitable for data stream classification, where target concepts change over time. The following section summarizes the use of ensembles in the classification of drifting streams.

RELATED WORK
There are several approaches to the design of adaptive ensemble models. Some of them use the same technique as approaches to static data processing, such as Online Boosting (Wang & Pineau, 2013), which is based on the classic Boosting method, extended with online processing capabilities. For adaptation to concept drift, it uses concept drift detection. If a concept drift occurs, the entire model is discarded and replaced by a new model. Another well-known model is the OzaBagging (Oza, 2005) ensemble. Unlike Bagging for static data, OzaBagging does not use random sampling from the training data, but each of the samples is trained k times, which leads to a Poisson distribution.
Further studies have focused on the design of the ensemble models that would be simple (in terms of their run time) and able to adapt to the concept drift dynamically. For example, the Accuracy Weighted Ensemble (AWE) (Brzeziński & Stefanowski, 2011) uses the assignment of weight to the base classifiers based on a prediction error. Old and weak members are gradually being replaced by the new ones, with a lower error rate. The update mechanism is based on the assumption that the latest training chunk will better represent the current test chunk. Another model, Dynamic Weighted Majority (DWM) (Kolter & Maloof, 2007), dynamically changes the weights of the base classifiers in the case of incorrect classification. A new classifier is added if the model incorrectly classifies the training example, and old classifiers are discarded if their weights fall below a threshold value. Online Bagging and Boosting algorithms were recently used as a basis for more advanced streaming ensembles, such as Adaptive Ensemble Size (AES) (Olorunnimbe, Viktor & Paquet, 2018), which dynamically adapts the ensemble size, or an approach (Junior & Nicoletti, 2019), where boosting is applied to the new batches of data and maintains the ensemble by adding the base learners according to the ensemble accuracy rate. Learn++ (inspired by AdaBoost) is an incremental learning ensemble approach consisting of base learners trained on a subset of training data and able to learn the new classes (Polikar et al., 2001). Several modifications of this approach exist, focused on improvement of the number of generated ensemble members (Muhlbaier, Topalis & Polikar, 2004). The Random Forests method is probably the most popular ensemble method on static data at present. Its adaptive version for stream classification, Adaptive Random Forests (ARF), was introduced in Gomes et al. (2017) and has shown a high learning performance on streaming data. More recently, multiple adaptive versions of popular ensemble methods gaining improved performance or achieving speedup in execution have been introduced, e.g., the adaptive eXtreme Gradient Boosting method (Montiel et al., 2020), the streaming Active Deep Forest method (Luong, Nguyen & Liew, 2020), or Random Forests with an implemented resource-aware elastic swap mechanism (Marrón et al., 2019).
All ensemble models work with the assumption of the diversity of the individual classifiers in the ensemble, while the diversity is achieved in different ways. Diversity can help in evolving data streams, as the most suitable method may also change as a result of the stream evolution Pesaranghader, Viktor & Paquet (2018). Diverse ensembles by themselves cannot guarantee faster recovery from drifts, but can help to reduce the initial increase in error caused by a drift Minku, White & Yao (2010). There are several ways to achieve diversity in the ensemble. Either the classifiers are trained on different data samples, or the model is composed of a set of heterogeneous classifiers. Recently, Khamassi et al. (2019) studied the influence of diversity techniques (block-based, weighting-data, and filtering-data) on adaptive ensemble models and designed a new ensemble approach that combines the three diversity techniques. The authors in Sidhu & Bhatia (2018) experimented with a diversified, dynamic weighted majority voting approach consisting of two ensembles (with low and high diversity, achieved by replacing the Poisson (1) with Poisson (κ) distribution in online bagging (Oza & Russell, 2001)). The Kappa Updated Ensemble (KUE) Cano & Krawczyk (2020) trains its base learners using different subsets of features and updates them with new instances with a given probability following a Poisson distribution. Such an approach results in a higher ensemble diversity and outperforms most of the current adaptive ensembles. However, there are not many studies where the model uses the model diversity score as a criterion for the base classifiers in the ensemble (Krawczyk et al. (2017)) as opposed to static data processing, where such a complex model exists (Lysiak, Kurzynski & Woloszynski, 2014). According to Yang (2011), diversity correlates with model accuracy. A suitable diversity metric used in ensembles is a paired diversity Q statistic (Kuncheva, 2006), which provides information about differences between two base classifiers in the ensemble.
Another aspect of the ensemble classifiers is the composition of the base classifiers in the model. The most common are homogeneous ensemble methods, which use the same algorithm to train the ensemble members (Fernandez-Aleman et al., 2019). On the other hand, heterogeneous approaches are based on the utilization of multiple algorithms to generate ensemble members. Such an approach could lead to the creation of more diverse ensembles. For the data stream classification, a Heterogeneous Ensemble with Feature drift for Data Streams (HEFT-Stream) Nguyen et al. (2012) builds a heterogeneous ensemble composed of different online classifiers (e.g., Online Naive Bayes). Adaptive modifications of the heterogeneous ensembles were also successfully applied on the drifting data streams (Van Rijn et al., 2016;Frías-Blanco et al., 2016;Van Rijn et al., 2018;Idrees et al., 2020), and many of them proved suitable to address issues such as class imbalance (Large, Lines & Bagnall, 2017;Fernández et al., 2018;Ren et al., 2018;Wang, Minku & Yao, 2018;Ghaderi Zefrehi & Altınçay, 2020). The approach described in this paper aims to combine the construction of the adaptive heterogeneous ensemble with a diversity-based update of the ensemble members. This approach could result in a robust model, with the adaptation mechanism ensuring that newly added members are as diverse as possible during the ensemble updates. Maintaining the diversity of the overall model can also lead to a reduction of model updates and therefore faster execution during the run time.

DDCW ENSEMBLE METHOD
In the following section, we introduce the design of the Diversified Dynamic Class Weighted (DDCW) ensemble model. The design of the model is based on the assumption that a robust model consists of a collection of heterogeneous base classifiers that are very diverse. When applied to static data, the diversity is used within the ensemble models to tune the combination rule for voting and the aggregation of component classifier predictions. We propose the design of a heterogeneous ensemble model, which combines the dynamic weighting of the ensemble members with the mutual diversity score criterion. The diversity measures are used to rank the members within the ensemble and update their weights according to the diversity value, so the model prefers experts with higher mutual diversity, thereby creating a more robust ensemble. When ranking the base classifiers, the diversity measurement is combined with the lifetime of individual base classifiers in the model. The criterion is expected to cause the importance of the long-lasting base classifiers to gradually fade, which should ensure the relevance of the whole ensemble to evolving and changing data streams.
The model is composed of m ensemble members e 1 ,…, e m , trained using each chunk of incoming samples in the stream, as depicted in Fig. 2. Each of those experts e 1 ,,…, e m , have assigned weights for each target class. The weights are tuned after each period (after each chunk is processed) based on individual base classifier performance. First, for each class that a base classifier predicts correctly, the weight is increased. Second, after each chunk is processed, the model calculates Q pairwise diversity between each of the ensemble members and uses this value to modify model weights.
Pairwise Q diversity metric is calculated as follows (Kuncheva & Whitaker, 2003): let Z ¼ z 1 ; …; z N be a labeled data set, z j 2 R n coming from the classification problem. The output of a classifier D i is an N-dimensional binary vector y i ¼ ½y 1;i ; …; y N;i T , such that y j;i ¼ 1, if D i recognizes correctly z j , and 0 otherwise, i ¼ 1; …; L. Q statistic for two classifiers, D i and D k , is then computed as: Q i;k ¼ N 11 N 00 ÀN 01 N 10 N 11 N 00 þN 01 N 10 where N ab is the number of elements z j of Z for which y j;i ¼ a and y j;k ¼ b. Q varies between −1 and 1; classifiers that tend to recognize the same samples correctly will have positive values of Q, and those that commit errors on different objects will render Q negative. For statistically independent classifiers, the value of Q i,k is 0.
The value for each member div_e i in DDCW model is obtained as the average of contributions of individual pair diversities and is calculated as follows: div e i ¼ 1 mÀ1 P m k¼1;k6 ¼i Q i;k . Then, after each period, the lifetime coefficient T i of each ensemble member is increased. Afterwards, the weights of each of the base classifiers are modified using the lifetime coefficient. After this step, the weights are normalized for each class, and a score is calculated for each target class by the classifier predictions. In the last step, the global ensemble model prediction is selected as a target class with the highest score.
Model weights can be represented as a matrix W mÂc , where m is a number of classifiers in the ensemble, and c is a number of target classes in the data. The weights w i,j directly determine the weight given to classifier i for class j as seen in Table 1. In the beginning, the weights are initialized equally, based on the number of classes in the data. During the process, the individual weights for each base classifier and corresponding target class are modified using the parameter β. The weight matrix allows the calculation of the score of the base classifiers, as well as the score of predicted target classes. The score of the classifier is calculated as P c j¼1 w i;j for each classifier. This score allows the identification of poorly performing base classifiers in the ensemble. However, the score of the target class is calculated as the contribution of weight w i,j of classifier i and its predicted class j.  Table 1 Weight matrix.
The weight matrix enables the efficient contribution of each of the ensemble members in building the model. We proceeded from the assumption of having a diverse ensemble. Such a model could consist of members, which perform differently in specific target class values, e.g., some of the members are more precise when predicting a particular class than others and vice versa. In that case, we can use a set of classifiers; each of them focused on a different target class (while the distribution of these classes may change over time). The class weighting alone may lead to an ensemble consisting of similar, well-performing models. But the combination of class weighting with diversity measure can lead to a set of balanced members, more focused on specific classes and complementing each other.
The proposed ensemble model belongs to the category of passive chunk-based adaptive models. In the presented approach, the size of the model dynamically changes according to the global model performance. The minimum size of the model is set by the parameter k, and the maximum size is set by the parameter l.
Each base classifier in the ensemble is assigned with a weight vector for each target class. If a new target class appears in the chunk used in training, the new target will be added to the weight vector for each base classifier. Initially, the ensemble consists of a randomly initialized set from a list of defined base algorithms (Hoeffding Tree or Naive Bayes). Other experts can be added in the following periods (interval representing the chunk of arriving data, where base learners and their weights are modified) until the minimum size of the model is reached, i.e., either the model size is smaller than the defined minimum size or the particular member weight falls under the defined threshold.
In each iteration, the experts are used to predict the target class of incoming samples in the processed chunk (Lines 3-5). If a prediction of an expert is correct, the weights of the particular expert and target class are multiplied by a coefficient β (Line 7). In the case period p has occurred, Q statistic diversity is calculated for each pair of experts in the ensemble, and the weights of each expert is modified using the particular expert's diversity (Line 12 and 16). This mechanism enables the construction of more robust ensembles, by preferring the diverse base models. The weights of base classifiers are also reduced by the exponential value of their lifetime in the ensemble (Line 15). In this case, the lifetime of the expert represents the number of periods since its addition to the ensemble. The exponential function is used, so the experts are influenced minimally during their initial periods in the ensemble but become more significant for the long-lasting members. This implementation works as a gradual forgetting mechanism of the ensemble model, as the weakest experts are gradually removed from the model and replaced by the new ones.
After the update, the weights are normalized for each target class (Line 24). Afterwards, if the maximum size of the model is reached and the global prediction is incorrect, the weakest expert is removed from the ensemble (Line 27). A new random expert can then be added to the ensemble (Lines 30-31). In each period, all experts where the sum of weights is lower than defined threshold θ are removed from the ensemble. In the end, each sample is weighted by a random uniform value m times, where m represents the actual size of ensemble (Line 41). Each expert is than trained with a new set of incoming

DATASETS DESCRIPTION
To experimentally evaluate the performance of the presented approach, we decided to use both real-world and synthetically generated datasets. We tried to include multiple drifting datasets that contain different types of concept drift. Datasets used in the experiments are summarized in Table 2.

Real datasets
In our study, we used 12 real datasets, including the frequently used ELEC dataset (Harries, 1999), the KDD 99 challenge dataset (Tavallaee et al., 2009), Covtype (Blackard, 1998), the Airlines dataset introduced by Ikonomovska (http://kt.ijs.si/elena_ ikonomovska/data.html), and data streams from the OpenML platform Bischl et al. (2019) generated from a real-world dataset using a Bayesian Network Generator (BNG) Van Rijn et al. (2014). We included a wide range of datasets to evaluate the performance of the DDCW model on datasets with both binary and multi-class targets or with balanced and imbalanced classes, especially some of them, such as KDD99 and Shuttle are heavily imbalanced. To examine the imbalance degree of the datasets, we included the class ratios in the documentation on the GitHub repository (https://github.com/Miso-K/DDCW/ blob/master/classratios.txt). As it is difficult to determine the type of a real drift contained in such data, we tried to estimate and visualize possible drift occurrences. Multiple techniques for concept drift visualization exist (Pratt & Tschapek, 2003). We used visualization based on feature importance (using the Gini impurity) and the respective changes within the datasets, as they may signalize changes in concepts in the data. Based on the work described in Cassidy & Deviney (2015), we used feature importance scores derived from the Online Random Forest model trained on the datasets. Such an approach can help to visualize the so-called feature drift, which occurs when certain features stop (or start) being relevant to the learning task. Fig. 3 shows such visualizations for the real-world datasets used in the experiments. The visualization depicts how the importance of the features changes over time in the data. x-axis represents number of samples, y-axis feature indices and the size of the dots correspond to a feature importance in the given chunk.

Synthetic datasets
Besides the real-world datasets, we used synthetic data streams containing generators with various types of drifts. In most cases (except LED and STAGGER data), we used streams of 1,000,000 samples, with three simulated drifts. We used the Agrawal generator (Agrawal, Imieliński & Swami, 1993) and SEA (Nick Street & Kim, 2001) generators with abrupt and gradual drifts, RBF and Waveform streams without any drift and with simulated gradual drift, a Stagger Concepts Generator (Schlimmer & Granger, 1986) with abrupt drift, an LED (Gordon et al., 1984) stream with gradual drift, and a Mixed stream with an abrupt drift with balanced and imbalanced target attributes.

EXPERIMENTAL RESULTS
The purpose of the experiments was to determine the performance of the DDCW model in a series of different tests. We used the python implementation of all models, and the experiments were performed using the Scikit-multiflow framework (Montiel et al., 2018). All experiments were performed on a virtual server equipped with 6 CPU cores and 8 GB RAM. During the first series of experiments, we aimed to examine the impact of the different setting of the chunk-size parameter in which the model is updated and the diversity is calculated. The model was tested with varying sizes of the chunk, with values set to 100, 200, 300, 400, 500, and 1,000 samples on all considered datasets. The primary motivation of the experiments was to find the most suitable chunk size parameter for different datasets, which will be used to compare the DDCW with other ensemble models.
To evaluate the models, we used prequential evaluation (or interleaved test-then-train evaluation), in which the testing is performed on the new data before they are used to train the model. In the second set of the experiments, the main goal was to compare the performance of the DDCW model with the selected other streaming ensemble-based classifiers. We considered multiple ensemble models: DWM, AWE, Online Boosting, OzaBagging, and the currently best performing streaming ensembles such as ARF and KUE. To analyze the performance of these models, standard classification metrics were used (accuracy, precision, recall, and F1). Besides the comparison of the model performance, we measured the metrics related to resource consumption, such as total time required for training, scoring time of an instance, and memory requirements of the models. During these experiments, we set the chunk size to 1,000 samples (although we included DDCW with a chunk size of 100 samples for comparison) and set the model's hyper-parameters to create similar-sized ensemble models (min. 5 and max. 20 members in the ensemble).

Performance with the different chunk sizes
In this experiment, we explored the influence of the chunk window on the classifier's performance on different datasets. The main goal was to find the optimal chunk window size for a particular dataset. We set different chunk sizes and measured the model performance using selected metrics in defined periods (e.g., every 100 samples).
We computed the average model performance on the entire dataset using the above-mentioned classification metrics. A comparison of the DDCW classifier accuracy and F1 with different sizes of the chunks is shown in Table 3.
Besides setting the chunk size, we fine-tuned the model hyper-parameters. Our main objective was to estimate the suitable combinations of the parameters α, β, and θ. As the experiments were computationally intensive, most of the models were trained using the default hyper-parameter settings, with particular values set to α = 0.002, β = 3, and θ = 0.02. Regarding the model behavior with different hyper-parameter settings, α influences the lifetime of an expert in the ensemble and the speed of degradation of the expert score with increasing lifetime. Increasing values led usually to a more dynamic model, able to adapt rapidly, while lower values led to a more stable composition of the model. β influences the preference of the experts, which classified the samples correctly. Higher values of this parameter can suppress the poor-performing experts and raise the probability of updating them in the following iteration. θ serves as a threshold for the expert update. Lower values usually lead to more weak experts in the ensemble, a marginal contribution to the performance , but raise the model complexity significantly. Higher values force the updating of weak experts. However, some of them may be missing later on, after drift occurrence and the reappearance of previous concepts. The results proved, that the chunk size does have an impact on the model performance, and there are mostly minor differences in the performance metrics with different chunk size parameter setting. Although accuracy is not affected much, F1 metric improves significantly with larger chunk sizes, especially on the BNG_Zoo and BNG_Lymph datasets. In general, we can observe, that on the larger data (with more than 100,000 samples in the stream), larger windows resulted in slightly better performance. On the other hand, smaller chunk sizes enable the model to react more quickly to concept drift. In some cases, the accuracy metric proved to be not very useful, as the target class is strongly unbalanced or multi-class. It is evident mostly on the KDD99 or BNG_Lymph datasets, where high accuracy values are caused mainly by the classification into the majority class, while other minor classes do not influence this metric very much. A much better perspective on the actual model performance could be given by F1 measure.
The experiments summarized averaged results that the models achieved on the entire stream, but it is also important to explore how the performance progressed during the stream processing and observe their reactions to concept drift. Figure 4 visualizes the accuracy achieved by the DDCW models on the real datasets with both chunk sizes. The performance of the method with both settings is overall similar; however, we can see a difference in cases when a change (possible drift) occurs. On the KDD99 dataset, there is a significant decrease in the accuracy of the model after around 52,000 samples. Shorter chunk windows resulted in a much earlier reaction to the change, without any significant decrease in performance. On the Elec and Covtype datasets, the earlier reactions are also present and visible, resulting in higher performance metrics. Figure 5 depicts the DDCW model performance on the synthetic datasets with both chunk sizes. In the case of Stagger and LED datasets, the effects of different chunk sizes are comparable with the impact on the real datasets. Larger chunk sizes lead to later reactions to drift and a more significant decrease in accuracy. In contrast, performance evaluation on larger streams, such as AGR or Mixed streams, showed that the chunk size effect on stream processing is different. Contrary to previous datasets, larger chunk sizes resulted in a more robust model, still covering some of the previous concepts after  drift occurrence. After the drift, the model performance dropped significantly. When using larger chunk sizes, the model was able to use more data to update the ensemble, which led to improved performance. Figure 6 depicts the size of the ensemble during stream processing on selected datasets with the diversity of the ensemble members disabled and enabled. During the experiments, we set the minimum size of the DDCW ensemble to 5 and the maximum size to 20. Initially, we performed a set of tests to estimate the optimal ensemble size. Larger ensembles did not perform significantly better, while the computational complexity of the model was considerably higher. The experiments proved that, during the run time, the algorithm preferred a smaller pool of ensemble members. The algorithm added more ensemble members when a concept drift occurred. Enabling the diversity-based selection of experts resulted in a more stable composition of the ensemble and required fewer member updates.

Comparison with other ensemble models
In these experiments, we compared the DDCW model performance with the selected ensemble models. In the comparison, we included AWE, DWM, Online Boosting, OzaBagging, ARF, and KUE models. Each of the ensemble models was tested with different base learners. We evaluated Naive Bayes, Hoeffding Tree, and k-NN as base learners. Similar to the previous set of experiments, we used the accuracy, precision, recall, and F1 metrics for comparison purposes computed in the prequential fashion. When using the DDCW model, we included the DDCW model with a combination of Naive Bayes and Hoeffding trees as base learners, as well as a tree-based homogeneous ensemble alone.

Performance comparison
To summarize the experiments, Table 4 compares the performance of all evaluated models on the real datasets and Table 5 provides the similar comparison on the synthetic data streams. As in the previous experiments, the tables consists of overall averaged results that the models achieved on the entire stream. While most of the studies focus only on a comparison of accuracy, we decided to analyze other classification metrics as well.
Especially in the case of multi-class or heavily imbalanced data (e.g., KDD99), accuracy might not be the best choice to truly evaluate the performance of the model, therefore we choose also F1 metric as well. Please note, that we were unable to properly obtain F1 values from the KUE model on some of the datasets. The DDCW model proved to be suitable for data streams with different concept drifts and either binary or multi-class classification tasks. When considering the composition of the DDCW ensemble, the fact that the model relies on different base learners enables it to utilize the strengths of particular learners. Dynamic composition of the ensemble enables it to adapt to the particular stream by preferring a base learner that is more suitable for the given data. In general, the DDCW performs very well on the generated streams, gaining at least competitive results compared to the other models on the real-world datasets. The method appears to struggle more with some of the imbalanced datasets, as is apparent from the F1 results achieved on the KDD 99 or Airlines dataset. During this experiment, we also used two different DDCW method setups to compare the effect of the base learner selection. We used DDCW with only Hoeffding trees as a base learner and DDCW with a combination of Naive Bayes and Hoeffding trees. Although the homogeneous ensemble mostly performed slightly better, the heterogeneous one was usually faster to train and score and maintained a similar performance, which was a result of the inclusion of fast Naive Bayes ensemble members. In a similar fashion, we experimented with an integration of k-NN into the DDCW model, but as expected, k-NN base learners raised the resource requirements of the model and failed to provide a sufficient performance boost, so we did not include k-NN base learners in further experiments. Performance comparison showed that the DDCW method can produce results that are comparable to the related ensemble models on both, real and synthetic streams. In many cases, it is able to outperform existing algorithms (e.g., AWE, DWM, OZA, and OB) in both of the explored metrics. Current state-of-the-art methods such as KUE and ARF usually produce slightly better results, but the DDCW method showed a fairly competitive performance, surpassing on several datasets one of those methods in both, accuracy and F1 scores. However, the evaluation metrics represent only one aspect of the adaptive models' performance. During the experiments, we tried to evaluate another aspect of the studied models that may influence the run time of the models during deployment in real-world scenarios. We focused mostly on monitoring the model performance in terms of their demand on resources and resource consumption during the process. During the experiments, we collected data about the overall run-time aspects of the model. The following section compares the models from the perspective of training/scoring times and memory requirements.

Training time and memory usage
We analyzed the training and scoring times and the memory consumption during the training process to provide a different view of the models' performance, comparing performance metrics with resource consumption requirements. We measured the overall training and scoring times on the entire data by summing up all partial re-training and scoring times over the stream. Table 6 summarizes the results of all evaluated models. The table compares the total training time consumed in the training and re-training of the models, the total scoring time of all processed instances, and the average size of the model in memory during the entire stream processing. The results represent an averaged value of the total of five separate runs of each experiment. It is important to note that the KUE model was not included in this comparison. We used Python implementations of all evaluated models and the scikit-multiflow library during the experiments. The KUE model was available only in its Java implementation using the MOA library, therefore using different underlying technologies could influence the objective comparison of resource consumption. At the same time, it is essential to note that the Java implementation of the KUE model was significantly effective than all Python-based models, mostly in training times, which were remarkably shorter. The choice of a base classifier heavily influences the overall run-time requirements of all models. Most apparently, the choice of k-NNADWIN as a base learner for OnlineBagging and OzaBagging methods leads to a massive increase of memory consumption and data processing times. Nearest neighbor classifiers require to store the training data in memory, which could lead to increased memory consumption and therefore increased training times. On the other hand, ensembles which consist of Naive Bayes and Hoeffding Tree classifiers as the base learners are much faster to train and require significantly lower memory during the run-time. However, Online Boosting and OzaBagging methods are much more sensitive to the number of features of the processed data. It can be observed on KDD99 and Covtype datasets, where these relatively faster models, required significantly longer times to either train or score the instances. DDCW ensemble training time and memory consumption requirements reflect the fact that the model consists of a mixture of Hoeffding Tree and Naive Bayes classifiers. When experimenting with the homogeneous DDCW ensemble, the performance results were better on many of the datasets. On the other hand, heterogeneous DDCW model provided a small decrease of the performance, but in most of the cases, inclusion of a Naive Bayes base learner led to a shorter training times and reduced memory usage (most significantly on e.g., Waveform data, where the training time was reduced to a half of the total training time of the homogeneous DDCW ensemble). When taking into consideration both aspects, DDCW model can, in some cases present a compromise between performance and resource requirements. Although Online Boosting or OzaBagging model performs with higher degrees of accuracy on some of the datasets, their computational intensiveness and more extended training and scoring times may be a factor to consider a simpler model. Similarly, ARF and KUE models provide superior performance on the majority of the datasets. When compared to those state of the art methods, DDCW method produced mostly comparable results, but needed less training time with lesser memory requirements (especially on the larger synthetic data streams) than ARF method. DDCW ensemble, in this case, may offer a reasonable alternative by providing a well-performing model, while maintaining reasonable requirements on run-time.

CONCLUSIONS
In the presented paper, we propose a heterogeneous adaptive ensemble classifier with a dynamic weighting scheme based on the diversity of its base classifiers. The algorithm was evaluated on a wide range of datasets, including real and synthetic ones, with different types of concept drift. During the experiments, we compared the proposed method with several other adaptive ensemble methods. The results proved that the proposed model is able to adapt to drift occurrence relatively fast and is able to achieve at least comparable performance to the existing approaches, on both real and synthetically generated datasets. While still performing well, the model also manages to maintain reasonable resource requirements in terms of memory consumption and time needed to score the unknown samples. The proposed approach is also dependent on chunk size parameter setting, as the performance of the model on certain datasets change significantly with different chunk sizes. Further research with adaptive heterogeneous ensemble models may lead to an exploration of modifications to weighting schemes that improve performance in multi-class classification problems or classifications of heavy imbalanced data. Another interesting field for future work is the integration of adaptation mechanisms with semantic models of the application domain. A domain knowledge model could provide a description of the data, the essential domain concepts, and their relationships. Such a model could also be used to improve classification performance by capturing expert domain knowledge and utilizing it in the process of classification of unknown samples. A knowledge model could be used to extract new expert features not previously contained in the data or to extract interesting trends present in the data stream. Such extensions could represent expert knowledge and could thus be leveraged to detect frequent patterns leading to concept drift while reducing the time normally needed to adapt the models with that knowledge.