1 Introduction

Classification is one of the main data mining tasks used in personalization, anomaly detection, recommendation and prediction. There are several commercial applications for data stream classification, such as sensor networks, Internet traffic management, web log analysis, intrusion detection, and credit and fraud detection. Many classification techniques are developed for data streams (Gaber et al. 2007). In this paper, we focus on discriminative associative classification in data streams.

The associative classification (AC) (Ma 1998; Li et al. 2001), is based on the integration of association rule mining and classification. The class association rules (CARs) predict class labels. The association classification can compete with a decision tree, rule induction and probabilistic classifiers (Ma 1998; Thabtah 2007; Abdelhamid and Thabtah 2014). These methods are based on frequent itemsets that pass the support threshold, and then out of them, association classification rules that meet the confidence criterion. We utilize the discriminative itemsets in the tilted-time window model and propose an accurate and efficient classification technique called discriminative association classification (H-DAC), in data streams.

Discriminative itemsets in the tilted-time window model are frequent in one data stream and their frequency in that stream is much higher than that of the rest of the streams, in different periods (Lin et al. 2010; Seyfi et al. 2017, 2021a, b). We discover class discriminative association rules (CDARs), in the tilted-time window model, out of discriminative itemsets based on discriminative value, minimum confidence and minimum support thresholds, in different periods. These are the class association rules (CARs) in one data stream that have higher support compared with the same rules in other data streams. They distinguish each data stream from all other data streams in each period. The interpretability of the CDARs is expected to be much higher than the CARs, as they show the dominant rules in each data stream which have less or no importance in other data streams. Compared to CARs, they exclude the rules which are dominant in more than one data stream. There are fast algorithms proposed for mining discriminative items (Lin et al. 2010; Seyfi 2011) discriminative itemsets in static datasets (Seyfi et al. 2014, 2017, 2021a, b), and in data streams (Seyfi 2018; Seyfi et al. 2021a, b).

Although the field of classification is overcrowded, we target an efficient and accurate classification method specifically for large multiple data streams. We propose an approach based on rule mining followed by rule selection, targeting the classification of large data streams efficiently, which moreover improves the accuracy. Compared to the traditional classification problems, we work on much larger datasets (i.e., as in the following scenarios), and our H-DAC algorithm shows much better time and space efficiency using its pruning heuristics (Seyfi et al. 2017). The reason for doing this research is to simplify the data stream classification based on mining CDARs as a type of contrast pattern. The necessity of discovering CDARs from multiple large fast speed data streams comes from mining sparse sets of distinguishing rules.

There are several examples where the significance of CDARs in data streams can be demonstrated. In network traffic measurements, looking for the concurrent activities of one user, that are more frequent in comparison to the rest of the group activities in the whole network. Moreover, they can be effectively used in web page document classification as well as in the personalization of search engines and news delivery services. The frequent patterns are generally frequent in all data streams and may not be distinctive (Lin et al. 2010). Analyzing the clickstream data for identifying the web pages that are visited by a specific user (or a group of users) more frequently than other users (or groups), can lead to improved personalized services. In dynamic tracing of stock market fluctuation, itemsets that occur more frequently in one stock as compared with other ones are of interest. They are useful for fraud detection if a group of customers are buying certain items more frequently than the rest of the population. An essential issue inherent in all the mentioned applications is to find the rules that can distinguish each stream from all other streams.

The most important challenge concerning data stream classification is that of concept drifts. Moreover, there are challenges raised by the combinatorial explosion of itemsets, demanding high time and space consumption. Considering concept drifts in evolving data streams, the classification rules are usually time sensitive. The rules appearing in the old time may not be dominant anymore and may have lost their attraction (e.g., rules in the news delivery services). Particular groups of rules appearing in a period should not affect the general trend in data streams during the history or recent time (e.g., rules related to specific events). Time-related rules and the changes in their trends during the history of data streams are of interest for recent rules in the short time intervals and the old rules in the larger time intervals. These rules are represented in the tilted-time windows for data stream classification. The tilted window model is made of multiple windows which are in different sizes and each one point to a specific period.

CDARs have a close definition of emerging patterns (EPs) (Dong et al. 1999). However, in EP mining the degree of change in support of itemsets is important, and the actual support of itemsets is not considered (Dong and Li 1999). Moreover, EPs are generally defined for static datasets. The discriminative itemsets proposed in this paper are discovered with their explicit relative supports and discriminative values (i.e., discrimination between support in each data stream vs support in all other data streams). The discriminative itemsets are small subsets of frequent itemsets, and the proposed H-DAC method is based on the fundamental of the FP-Growth (Han et al. 2000) and FP-Stream (Giannella et al. 2003) methods. As the newly proposed classification techniques, we refer to the Error-driven discriminative learning algorithm (Hoppe et al. 2022), Novel framework for cancer classification (Aziz 2022a, b), Discriminative network for time series classification (Wang et al. 2022), Metaheuristics model for gene classification (Aziz 2022a, b), and Association rules in Graph Network (Zhang et al. 2022). We will discuss these in the next section briefly.

To the best of our knowledge, ours is the first work on discriminative associative classification mining in data streams using the tilted-time window model. Compared with associative classification mining, we utilize the efficient H-DISSparse method (Seyfi et al. 2021a, b) for mining discriminative itemsets in data streams. We extract the rules out of the discriminative itemsets. This does not follow the Apriori property which is mainly used in frequent itemset mining, either proposed based on Apriori or FP-Growth (Thabtah 2007; Abdelhamid and Thabtah 2014). The H-DISSparse algorithm works based on two data streams i.e., the target data stream and the general data stream. We propose an advanced high-efficient and high-accurate method, called H-DAC, to work based on multiple data streams. In fact, instead of mining rules in the target data stream vs general data stream (i.e., on vs all), we discover the rules in each data stream vs all other data streams (i.e., each vs all). We propose the in-memory prefix-tree structure, called H-DACStream as in FP-Growth (Han et al. 2000). This is used for holding the rules in the tilted-time window model. Finding a small set of rules from an exponential number of itemsets is time-consuming. The rule pruning then happens by deleting the misleading rules. After that, the discovered rules are ranked based on their confidence, discriminative value and support. Then, the true sets of rules are selected for classifying the data streams. Finally, the rules are evaluated. Despite the challenges, discriminative associative classification is an emerging research area with great potential. The proposed method is tested on various data streams showing different characteristics. Empirical analysis shows its high accuracy with efficient time and space usage.

To summarize our contribution, first, we define a new set of contrast patterns called CDARs from data streams, and then we develop concise data structures for holding the discovered patterns (rules) in the history of data streams, next, we proposed a novel algorithm for data stream classification, and finally, we evaluate our algorithm using large data streams. More specifically, the following contributions are made:

  • Defining the problem of mining “discriminative associative classification” in multiple data streams;

  • Introducing the novel in-memory H-DACStream structure holding class discriminative association rules (CDARs);

  • Developing the single-pass H-DAC algorithm for mining the CDARs accurately and efficiently in data streams based on the tilted-time window model; and

  • Evaluating the proposed algorithm in a range of real data streams with different parameter settings.

The rest of the paper is organized as follows. In the next section, the related works are presented and the research problem is defined in Sect. 3. In Sect. 4, the proposed method is presented in detail, followed by experimental results reported in Sect. 5. The conclusion and future works are presented in Sect. 6.

2 Related works

Recently, several classification methods have been proposed. The Error-driven learning algorithm (Hoppe et al. 2022) iteratively highlights the discriminative nature of learning by adjusting the expectations based on prediction error. The Novel framework for cancer classification (Aziz 2022a, b) reduces the classifier’s prediction error and speeds up the convergence using informative genes. The Discriminative network for time series classification (Wang et al. 2022) considers the relative importance of the temporal data at all classes for training errors to improve the classifier performance. The Metaheuristics model for gene classification (Aziz 2022a, b) minimizes the number of selected genes and at the same time maximizes the classification accuracy. The Association rules in Graph Network (Zhang et al. 2022) use association-rule-based and graph-based attention weights together to cover a wide variety of classification applications. Moreover, several approaches are defined in the data profiling research area to address the problems of discovering rules and meta-data from dynamic scenarios that evolve, such as data streams (Caruccio et al. 2021a, b). This requires defining search strategies and validation techniques for analyzing only the updated portion of the dataset affected by new changes. In addition, a method was proposed based on functional dependency in dynamic datasets (Schirmer et al. 2019). This method inspects dataset changes, and evolves its functional dependencies rather than recalculating them. Some platforms allow configuring and monitoring the meta-data discovery from data streams. Although the definition of these types of efficient algorithms is a very challenging task, their configuration and monitoring are often complex, requiring the definition of new visual tools capable of performing these operations in real time, such as (Villanueva et al. 2014; Doan et al. 2015; Breve et al. 2021; Caruccio et al. 2021a, b). In this paper, we propose a novel data stream classification method based on discriminative itemsets (i.e., CDARs).

CDARs have a close definition of emerging patterns (Dong and Li 1999). EPs are defined as itemsets whose frequencies grow significantly higher in one dataset in comparison to another one. They are identified by extracting the maximal itemsets separately for each dataset using a defined minimum threshold. A group of maximal itemsets are then reported between the two borders (Dong and Li 1999). In EP mining, the degree of change in support of itemsets is important, and the actual support of itemsets is not considered (Dong and Li 1999). For CDARs, the exact support of rules is known to conduct a comparison of their support counts with minimum confidence, minimum support and discriminative level thresholds. Moreover, EPs are generally defined for static datasets. Authors (Alhammady and Ramamohanarao 2005) have attempted EPs mining in data streams based on the same idea of border definition. This method showed the EPs related to each block of transactions and discard the block from the process. In Bailey and Loekito (2010) method is proposed for mining contrast patterns in changing data based on the old and the current parts of a data stream. The method is focused on jumping emerging patterns as special types of contrast patterns. The minimal JEPs are discovered in the data stream by adding new transactions and deleting the old transactions. This is different to the problem which is focused on in this paper, as the contrast patterns are discovered in the old part (i.e., old class) compared to the recent part (i.e., recent class) of a single data stream. The discriminative rules proposed in this paper are discovered in multiple data streams changing at the same time. The emerging pattern mining with streaming feature selection (Yu et al. 2015) dynamically selects and maintains the effective features from the feature stream.

The association classification (AC) has a close concept to our proposed method. The classifiers are designed based on the association rule discovered from frequent itemsets. These methods mainly have four steps: rule ranking, rule pruning, rule prediction and rule evaluation (Ma 1998; Li et al. 2001; Yin and Han 2003; Thabtah et al. 2004). The most challenging step is rule mining. In classification datasets, usually, we have a large number of association rules and rule discovery is very time-consuming. This does not apply to large and fast-speed data streams. We use discriminative itemsets as a small subset of frequent itemsets. They show the distinguishing characteristic of the data streams compared with each other. The effectiveness and efficiency of the discriminative itemset mining have a high impact on the quality of the developed classifier. However, the rule mining challenges also should be addressed.

We utilize the H-DISSparse method (Seyfi et al. 2021a, b) for mining CDARs in data streams using the tilted-time window model. The method does not require the prior frequency calculation of all generated itemset combinations. It is a heuristic-based method, and effectively utilizes a prefix-tree structure (i.e., H-DACStream) for holding the rules in the tilted-time window model. The empirical analysis of the proposed method reveals that it can produce a set of CDARs with good approximation. The rule ranking is based on rule precedence. The most distinguishing characteristics of the rules are applied (i.e., confidence, discriminative value and support), respectively. There are several rule pruning methods in the literature (Thabtah 2007;Abdelhamid and Thabtah 2014). We prune redundant and misleading rules mainly based on the defined confidence and support thresholds and rule lengths. It has to be considered that our method prunes many rules during rule mining (i.e., because of discriminative values). The classifier at the end is evaluated based on the general measures defined for classification.

Compared with the current associative classification techniques our proposed discriminative associative classification method is different. First, our method is based on the discriminative itemsets in multiple data streams. The AC methods are designed based on frequent itemsets only, working on single datasets. Second, the Apriori property defined for the frequent itemset mining is not valid and a subset of CDARs can be non-discriminative. Third, the rule ranking and rule pruning processes are much different as the number of discovered rules is much smaller. Fourth, there is no significant AC method for data stream classification.

The proposed H-DAC method is based on the fundamental of the FP-Growth method (Han et al. 2000). Our proposed method effectively utilizes a prefix-tree structure for holding CDARs in the tilted-time window model. The logarithmic tilted-time windows model (Giannella et al. 2003) is used to display the recent rules in fine granularities and the historical ones in coarse granularities with approximate supports as in Seyfi et al. (2021a, b). The proposed method can produce an exact set of rules in the current batch of transactions and the historical tilted-time window model with high precision and recall.

In summary, our proposed method is positioned in the category of associative classification methods. CDARs are the rules with discriminative itemsets on their left side and class labels on their right side. Each rule is considered with its approximate discriminative value, support and confidence. The rules are in concise numbers as many associative classification rules (i.e., frequent) are not discriminative in any class and are pruned during rule mining, as will be discussed. The advantages of this method are based on its rule interpretability in the application domain, high accuracy and efficiency in applying to large multiple data streams.

3 Problem statement

Let the streams consist of a continuous flow of different length transactions \(T\), each transaction made of the lexicographically ordered alphabet of items \(\sum\) (i.e., we cannot assume that the stream is sorted in any way, except by the time of arrival of the data and this ordering is only for the simplicity of the presentation). The data streams \(\bigcup\nolimits_{i = 1}^{m} {S_{i} }\) are defined, each consists of a different number (i.e., cardinality) of transactions \(n_{i}\), \(i = 1,2, \ldots ,m\) (i.e., \(n_{i}\) is the size of each data stream \(S_{i}\)). A group of input transactions from multiple data streams \(S_{i}\) (i.e., \(i = 1,2, \ldots ,m\)) in the pre-defined period are set as a batch of transactions \(B_{n}\), i.e., \(n \ge 1\). It is necessary to consider that more transactions can arrive at the same time in multiple data streams.

The tilted-time window model is composed of different window frames denoted as \(W_{k} .{ }\) i.e.,\( k \ge 0\) as in Fig. 1. Each window frame \(W_{k}\) refers to a different period containing rules made of transactions from different numbers of batches in multiple data streams \(\bigcup\nolimits_{i = 1}^{m} {S_{i} }\). Each window frame \(W_{k}\) is with the length (i.e., cardinality) of \(n_{i}^{k}\), respectively. The current window frame is denoted as \(W_{0}\).

Fig. 1
figure 1

Tilted-time window frames

An itemset \(I\) is defined as a subset of transactions in data streams. The itemset frequency is the number of transactions that contain the itemset. The distribution of itemset \(I\) in data stream \(S_{i}\) in the window frame \( W_{k}\) is denoted as \(C_{i}^{k} \left( I \right)\) and the frequency ratio of itemset \(I\) in data stream \(S_{i}\) in the window frame \( W_{k}\) is defined as \(r_{i}^{k} \left( I \right) = C_{i}^{k} \left( I \right)/n_{i}^{k}\) (i.e., \(i = 1,2, \ldots ,m\) and \(k \ge 0\)). The discriminative itemsets are those itemsets that occur in one data stream more frequently than all other data streams. In other words, we look for itemsets that are frequent in the data stream \(S_{i}\) and their frequency in that data stream is higher than the same itemsets in all other data streams based on the specified threshold \(\theta > 1\), called a discriminative level. An itemset \(I\) is considered discriminative in the window frame \(W_{k}\) if \(R_{{i\bigcup\nolimits_{j \ne i}^{m} j }}^{k} \left( I \right) \ge \theta\), using the following definition; \(\exists i \in \left\{ {1,2, \ldots ,m} \right\}\) such that:

$$ R_{{i\bigcup\nolimits_{j \ne i}^{m} j }}^{k} \left( I \right) = \frac{{r_{i}^{k} \left( I \right)}}{{\mathop \sum \nolimits_{j \ne i}^{m} r_{j}^{k} \left( I \right)}} = \frac{{C_{i}^{k} \left( I \right)\mathop \sum \nolimits_{j \ne i}^{m} n_{j}^{k} }}{{\mathop \sum \nolimits_{j \ne i}^{m} { }C_{j}^{k} \left( I \right)n_{i}^{k} }} \ge \theta $$
(1)

The reason for \(j \ne i\) in the sum of the denominator is that the ratio of itemset frequency is defined as its ratio in each data class vs ratio in all other data classes, and \(r_{i} \left( I \right)\) should be divided by the sum of all \(r_{j} \left( I \right)\) excluding itself. This makes the denominator change for every \(i\), and thus it makes a comparison of the measures more relative. To deal with the situation when \(\sum\nolimits_{j \ne i}^{m} {C_{j}^{k} \left( I \right)} = 0\) a user-specified minimum support threshold \(0 < \varphi < 1/\theta\) is used. An itemset \(I\) become discriminative if its frequency the window frame \( W_{k}\) is greater than \(\varphi \theta n_{i}\) and \(R_{ij}^{k} \left( I \right) \ge \theta\). The set of rules (\(DI\)) in data streams in different window frames, is defined as; \(\exists i \in \left\{ {1,2, \ldots ,m} \right\}\) such that:

$$ DI_{{i\bigcup\nolimits_{j \ne i}^{m} j }}^{k} = \left\{ {I \subseteq \sum |{ }C_{i}^{k} \left( I \right) \ge \varphi \theta n_{i}^{k}\; \& \;R_{{i\bigcup\nolimits_{j \ne i}^{m} j }}^{k} \left( I \right) \ge \theta } \right\} $$
(2)

The itemsets that are not discriminative in the current window frame \(W_{0}\) can be discriminative in some larger window frames in the tilted-time window model (e.g., by merging the multiple window frames). To avoid missing potential discriminative itemsets in larger window frames, we propose to identify sub-discriminative itemsets in the tilted-time window model with a relaxation of \( \alpha \in \left( {0,1} \right)\).

The logarithmic tilted-time window model is a compact data structure for holding the discriminative itemsets in multiple data streams in different time frames; for example, a batch of transactions in one minute is supposed to be the smallest period. The current window frame shows the discriminative itemsets in data streams in the last minute and it is followed by the results in the remaining slots of the next 2 min, 4 min, 8 min, etc.

Let \(DI\) be the discovered itemsets and \(Y \in \left\{ {1,2, \ldots ,m} \right\}\) be the set of class labels in the dataset. The class discriminative association rules (CDARs) are defined as the rules \(I \to y\) which \(I \subseteq DI\), and \(y \subseteq Y\). A rule \(I \to y\) holds in data streams in window frame \( W_{k}\) with confidence \(c\), if \(c\%\) of cases in \(D\) that contain \(I\) are labeled with class \(y\) (i.e., \(\frac{{C_{i} \left( I \right)}}{{\mathop \sum \nolimits_{j = 1}^{k} C_{j} \left( I \right)}}*100\% \ge c\)). The rule \(I \to y\) has support \(s\) (i.e., \(s = \varphi *100\%\)) in data streams in window frame \( W_{k}\) if \(s\%\) of the cases in \(D\) contains \(I\) and are labeled with class \(y\) (i.e., \(\frac{{C_{i} \left( I \right)}}{{\mathop \sum \nolimits_{j = 1}^{k} n_{j} }}*100\% \ge s\)). We define the rules in multiple data streams (i.e., \(\left\{ {1,2, \ldots ,m} \right\}\)).

The rule mining algorithm skips the non-potential discriminative itemsets using two heuristics (i.e., in a batch of transactions belonging to multiple data streams). In Seyfi et al. (2017), it is proved that the heuristic-based DISSparse algorithm is correct (i.e., does not miss any discriminative itemset) and is complete (i.e., it generates the potential superset of all discriminative itemsets), and works efficiently. These two heuristics will be discussed in Sect. 4.2. We then generate the complete set of rules that satisfy the user-specified discriminative value (called ratio \(\theta\)), minimum confidence (called minconf \(c\)) and minimum support (called minsup \(s\)) constraints, and then build a classifier out of them.

Example 1

We assume having two attributes (i.e., \(A\) and \(B\)) within two data streams with the current lengths \(n_{1}^{0} = 10\) and \(n_{2}^{0} = 40\), respectively (i.e., the total number of cases in data streams is \(50\)). Consider the itemset \(\left\{ {\left( {A,1} \right),\left( {B,1} \right)} \right\}\) in data streams with \(sup = 15\). There are two rules: \(\left\langle {\left\{ {\left( {A,1} \right),\left( {B,1} \right)} \right\},\left( {stream,1} \right)} \right\rangle\) with support \(5\) (i.e., \(sup = \left( {\frac{5}{50}*100\% } \right) = 10\% )\)) and \(\left\langle {\left\{ {\left( {A,1} \right),\left( {B,1} \right)} \right\},\left( {stream,2} \right)} \right\rangle\) with support \(10\) (i.e., \(sup = \left( {\frac{10}{{50}}*100\% } \right) = 20\%\)). The first rule is discriminative in \(stream 1\) vs other streams (i.e., \(stream 2\)) with a discriminative level equal to \(2\) (i.e., \(R_{{1\bigcup\nolimits_{j \ne 1}^{2} j }}^{0} = \frac{{C_{1}^{0} \left( I \right)\mathop \sum \nolimits_{j \ne 1}^{2} n_{j}^{0} }}{{\mathop \sum \nolimits_{j \ne 1}^{2} C_{j}^{0} \left( I \right)n_{1}^{0} }} = \frac{5*40}{{10*10}} = \frac{200}{{100}}\)). The confidence of the first rule is \(33.3\%\) and the second rule is \(66.7\%\). By setting \(s\) equal to \(10\%\) both rules are frequent. By setting \(\theta = 2\) the first rule is discriminative in \(stream 1\) versus \(stream 2\). It has to be noted, that just having higher support in a data stream is not enough and the support ratio of the rule between data streams (i.e., discriminative value) should be higher than \(\theta\). For any discriminative rule, the confidence of the rules in different data streams can be smaller, bigger or equal to each other.

Considering all the rules have the same itemsets on their left side (i.e., \(I\) in \(I \to y\)), we choose the one with the highest confidence. In the case of having more than one rule with similar \(I\) and same confidence we choose the one with the highest discrimination. For more than one rule with similar \(I\) and same confidence and same discrimination, we select the one with the highest support. Finally, in case of similarity in all criteria, we select the rule with the shortest length. In our Example 1 we observed the rule \(\left\langle {\left\{ {\left( {A,1} \right),\left( {B,1} \right)} \right\},\left( {stream,1} \right)} \right\rangle\) with \(sup = 10\%\),\( confd = 33.3\%\) and \(Dis = 2\). The other one, \(\left\langle {\left\{ {\left( {A,1} \right),\left( {B,1} \right)} \right\},\left( {class,2} \right)} \right\rangle\), is with \(sup = 20\%\),\( confd = 66.7\%\) and \(Dis = 0.5\), which is not discriminative (i.e., it is not a rule). The rules with support greater than \( s\) are frequent. The rules with discrimination greater than \(\theta\) are discriminative. The rules with confidence greater than \(c\) are accurate. We aim to discover the set of rules that are all frequent, discriminative and accurate.

We (1) discover the set of CDARs in multiple data streams (with good approximation), that satisfy the user-specified minimum support, minimum discriminative value and minimum confidence and (2) build a data streams classifier from the CDARs.

The rule mining process is time-consuming and we have to deal with a large number of combinations in the multiple data streams. The proposed algorithm has to be time and memory-efficient. The Apriori property of subsets defined for the association rules is not valid for discriminative rules, that is, not every subset of a rule is discriminative. We also need to consider a generation of compact data structures containing all sets of rules so that they can be discovered, ranked, pruned and evaluated appropriately. Even in a non-streaming environment, this process is time-consuming. The process becomes more cumbersome for mining discriminative itemsets from data streams.

4 H-DAC method

Data streams are defined as continuous batches, each containing a different number of transactions depending on their speed. This is shown as \(B_{1}\), … \(B_{h}\), \(B_{h + 1}\),…, \(B_{n}\) with \(B_{n}\) as the most recent one and \(B_{1}\) as the oldest one. Each batch is made of multiple data streams \(\bigcup\nolimits_{i = 1}^{m} {S_{i} }\). We propose a method based on a novel in-memory prefix-tree structure called H-DACStream, for monitoring all classification rules in multiple data streams using the tilted-time window model. We propose rule mining based on modified H-DISSparse, rule pruning and rule ranking based on rule precedence. The H-DISSparse is modified to work on multiple data streams and output the rules to the H-DACStream and its built-in tilted-time windows. The rule precedence is presented in the following section.

4.1 The rule precedence

The best classifier should be made of the subset with the right rules that gives the least number of errors. This involves evaluating all the possible rules which is a combinatorial problem. Although the CDARs are a small subset of CARs, still they can be large in number. We use the rule ranking heuristics to make a total order of the generated rules and select the best subset.

Definition

Having two rules \(r_{i}\) and \(r_{j}\), \(r_{i} > { }r_{j}\) (also called \(r_{i}\) precedes \(r_{j}\)) if

  1. 1.

    The \(r_{i}\) confidence is greater than that of \(r_{j}\), or

  2. 2.

    They have similar confidences, but the \(r_{i}\) discriminative level is greater than \(r_{j}\), or

  3. 3.

    They have both similar confidences and discriminative levels, but \(r_{i} \;{\text{support }}\) is greater than that of \(r_{j}\), or

  4. 4.

    They have all similar confidences, discriminative levels and supports, but \(r_{i}\) is generated earlier than \(r_{j}\) (i.e., has fewer attributes in its left side);

We select the rules for rule ranking from all tilted-time window frames. Let \(R\) be the set of all the generated rules (i.e., CDARs) during the history of data streams, and \(\bigcup\nolimits_{i = 1}^{m} {S_{i} }\) the training data streams. The main idea of the algorithm is to select a set of high precedence rules in \(R\) to cover data streams. Our classifier is of the following format:

$$ < r_{1} ,{ }r_{2} , \ldots ,{ }r_{n} ,default\_class > , $$

where \(r_{i} \in R\), \(r_{a} > { }r_{b}\) if \(b > { }a\). \(defualt\_class\) is the default class. For classifying any new case (i.e., a new transaction), two criteria are considered. First, we obtain any rule that satisfies the case. Second, and for better accuracy, we count the number of rules (i.e., rule-counter) that satisfies the case (i.e., cardinality in each class). Finally, the maximum values obtained from the multiplication of the cardinality of rules, in each case, with the rule precedence criteria, classify the case. In case of no rule applying the case, the default class is taken as in C4.5 (Quinlan 2014) and AC methods (Ma 1998; Li et al. 2001; Yin and Han 2003; Thabtah et al. 2004). We propose an algorithm for building a classifier made of three steps as in Algorithm 1.

Step 1: Discover the set of rules \(R\) based on a modified version of the H-DISSparse algorithm from multiple data streams. It ensures the rules with high accuracy are considered for our classifier (lines 1–2).

Step 2: Find all the matched rules with every new transaction \(Tr\) in the current batch and H-DACStream (lines 3–4).

Step 3: Based on relation >, select the highest precedence rule for our classifier and measure the accuracy using cross-validation (lines 5–6).

4.2 Rule mining algorithm

The modified H-DISSparse algorithm generates all the discriminative itemsets in a batch of transactions as training data streams. The method does not generate all itemsets. It uses two heuristics as proposed in Seyfi et al. (2017) to efficiently mine only the potential discriminative itemsets, as follows.

During the itemset generation process, the two heuristics defined in the DISSparse algorithm, eliminate many non-potential discriminative itemsets. These heuristics were applied before the itemset combination generation process. The principles of the FP-Growth algorithm (Han et al. 2000) were used for generating itemset combinations. However, the divide and conquer used in FP-Growth does not work in DISSparse as the Apriori property is not true for discriminative itemset mining (i.e., a subset of a discriminative itemset can be non-discriminative). In the DISSparse the itemset generation process was done incrementally by discovering the discriminative itemsets that are ending with specific items. The two defined heuristics were for avoiding the generation of non-potential discriminative itemsets. The first heuristic was defined on the whole set of itemsets starting with a specific item and ending with another specific item (i.e., a subtree in conditional FP-Tree). The second heuristic was defined on the internal items that are between itemsets, starting with a specific item and ending with another specific item (i.e., internal nodes in a subtree in conditional FP-Tree). The second heuristic depends on the first heuristic and was only applied if the first heuristic confirmed that the set had the potential for generating a discriminative itemset. The DISSparse, using these two heuristics, either skipped the itemset combination generation from the whole set of itemsets starting with a specific item and ending with another specific item, or part of its internal items. The DISSparse algorithm’s heuristics were justified (Seyfi et al. 2017, 2021a, b) for their correctness and completeness. Moreover, using a different set of experiments with large synthetic and real datasets the algorithm efficiency caused by these heuristics was reported.

The two proposed heuristics in DISSparse, worked only based on one data class (i.e., target data class). They eliminate the itemsets that are non-potential discriminative in the target class vs general class (i.e., a summary of all other classes). Here, we modified the heuristics to work for each stream vs all other streams instead of one stream vs all other streams. For this, we consider one stream as the target stream and then discover the discriminative itemsets in that stream vs all other streams (i.e., considered as one called general class). Then we do the same process for each class, respectively.

It then updates the window model following the principles of Giannella et al. (2003). The CDARs do not follow the Apriori property defined for the ACs and a subset of rules can be non-discriminative. The tail pruning techniques proposed in FP-Stream (Giannella et al. 2003) are not applicable. Based on the properties of the discriminative itemsets in the tilted-time window model, three corollaries are proposed in Seyfi et al. (2021a, b). This guarantees the highest refined approximate support and approximate ratio bound in the tilted-time window model. At first, the algorithm discovers the discriminative itemsets and then adds each itemset as a new rule to DAC-Tree.

DAC-Tree This is a prefix tree structure similar to the one proposed in the FP-Growth (Han et al. 2000) and is used for holding the discriminative itemsets of the data streams in the most concise way (i.e., by sharing the branches for their most common frequent items). Every branch starting from root to a particular node is considered as the prefix of the itemsets ending at a node after that particular node. Part of the nodes may only be a subset of discriminative itemsets and do not present the rules. As the Apriori property is not followed, a subset of a discriminative itemset can be non-discriminative. Compared with FP-Growth, each rule node in the DAC-Tree has additional metrics as it is associated with its multiple frequency counters, class label, confidence, discriminative value and support on the path starting from the root and ending at this node. Moreover, we hold the true-positive and true-negative counters for each rule (e.g., there are six metrics associated with rule nodes in DAC-Tree as in Fig. 2 based on Example 1). This prefix tree is used for cross-validation to select the optimum set of rules out of the current batch and the window model.

Fig. 2
figure 2

DAC-Tree based on Example 1

For mining the optimum set of rules, we make another similar prefix-tree called Optimum-DAC-Tree. This is generated out of rules in each fold (of cross-validation), that have a true-positive percentage (i.e., \( \frac{TP}{{TP + TN}}*100\%\)) higher than the last data stream classification accuracy (i.e., not including the current batch). In Optimum-DAC-Tree, we also hold the number of folds that rule appeared, with accuracy higher than the last data stream classification accuracy. This avoids overfitting by choosing the rules, with good accuracy, that appeared in mining several folds (e.g., at least 20% of the folds based on our experiments). We update the window model with the CDARs which have a positive impact on the classifier accuracy. At the end of k-fold cross-validation, the rules in Optimum-DAC-Tree are sent to another prefix-tree structure called, H-DACStream, as defined below.

H-DACStream This is a similar structure as DAC-Tree, but it has a built-in tilted-time window model for holding the discovered discriminative and sub-discriminative rules as well. After mining the best set of rules match to the latest batch of transactions in data streams, the rules are transferred to this prefix tree (i.e., \( W_{0}\)), and the window model is shifted and merged accordingly (i.e., \( W_{k}\), \(k > 0\)). Each node in the H-DACStream has multiple counters \(C_{i} \left( I \right)\) (i.e., \(0 < i < m\)) for holding itemset frequencies in each of the data, respectively, in the current window frame \( W_{0}\). Each H-DACStream node may have a built-in tilted-time window frame if the rule is discriminative in the larger window frame \(W_{k}\) (i.e., \(k > 0\)), sub-discriminative in the historical summary of the window frames or appeared as a non-discriminative subset of discriminative or sub-discriminative itemsets in any window frame. The same as DAC-Tree, each rule node in the H-DACStream is associated with its multiple frequency counters, class label, confidence, discriminative value and support on the path starting from the root and ending at this node (e.g., there are four metrics associated with rule node in H-DACStream as in Fig. 3 based on Example 1).

Fig. 3
figure 3

H-DACStream based on Example 1

There are three corollaries proposed in H-DISSparse method (Seyfi et al. 2021a, b) for mining the highest refined approximate bound in discriminative itemsets in the tilted-time window model. The first corollary says the method holds exact frequencies of itemsets from the time they are maintained in the tilted-time window model. This is done by mining the exact discriminative itemsets in the current batch \(W_{0}\) based on DISSparse method (Seyfi et al. 2017), and tuning the non-discriminative itemsets stay as an internal node in the current window frame (i.e., a subset of discriminative itemsets), obtained by traversing the FP-Tree (i.e., holding all transaction in the current batch) for their exact appearances in the current batch of transactions. The second corollary modifies the first corollary using the relaxation of \(\alpha\) as a set of non-discriminative itemsets in the current batch held in \(W_{0}\) which are discovered as a sub-discriminative itemset. The sub-discriminative itemsets in the current window frame \(W_{0}\) and the history of data streams are discovered for better approximation in itemset frequencies and itemset frequency ratios. The third corollary used for tail pruning by tagging the non-discriminative itemsets stays as leaf nodes to be deleted from the tilted-time window model for space saving.

In our algorithm, the rule mining is based on offline batch processing. At first, the training set of each fold of cross-validation is processed and the set of rules in, in the most recent batch, are mined in DAC-Tree with their data stream frequencies and four other metrics each. After that, the test dataset, of each fold of cross-validation, is read transaction by transaction. By each new transaction, using a recursive function, the DAC-Tree of the current fold and the rules in the H-DACStream are traversed. With the new transaction, all matched rules are collected, and then ranked based on another function. Every new transaction is classified with its class label and the process continues by measuring classification accuracy using cross-validation. The total number of errors that are made by the classifier is recorded. This shows the sum of the number of errors that have been caused by all the selected rules in the classifier. During the cross-validation, the best rules in the current fold accompanying the rules in the tilted-time window model are selected. By the end of cross-validation, the classifier’s accuracy is reported. The rules in the best DAC-Tree (i.e., belongs to the fold with higher accuracy) are used for updating the H-DACStream and its tilted-time window model. The main challenge is setting the right set of parameters for mining more accurate rules for different data streams. This will be discussed in the empirical evaluation in Sect. 5. The algorithm is given in Algorithm 1.

figure a

In the first step, the H-DAC algorithm scans the training set of current \(B\), and runs the modified H-DISSparse to generate all discriminative itemsets in each stream vs all other streams in the tilted-time window model. In the second step, it makes DAC-Tree by adding non-redundant, confident and frequent rules with their class label, discriminative value, confidence and support. In the third step, it scans every new transaction \(Tr\) in the test set of current \(B\) and discovers all the matched rules in DAC-Tree, using Algorithm 2 DACTraverse. In the fourth step, it scans every new transaction \(Tr\) in the test set of current \(B\) and discovers all the matched rules in H-DACStream, using Algorithm 2 DACTraverse. In the fifth step, it ranks the rules based on confidence, discriminative value and support, using Algorithm 3 H-DACStreamRanking and selects the class label for new transaction \(Tr\). The sixth step measures H-DAC accuracy using cross-validation. And finally, in the seventh step, it returns all \({\text{CDARs}}\) (i.e., \(DI_{{i\bigcup\nolimits_{j \ne i}^{m} j }}^{k}\)) for each window frame \(W_{k}\) and H-DAC accuracy.

For the first batch (i.e., \(B_{1}\)), the method takes two scans on the datasets (i.e., DISSparse necessity). This is for making the concise data structures and faster processing time in which items are ordered by decreasing frequencies as in Han et al. (2000). The significant part attracting considerable complexity is related to the DISSparse algorithm by generating the potential discriminative itemsets. Updating the tilted-time window model by shifting and merging, tuning the frequencies of the non-discriminative subsets and applying the tail pruning in the H-DACStream structure have less complexity considering the sparsity property of CDARs. In H-DAC, the tilted-time window model is updated only after finding the DAC-Tree with the best accuracy out of the current batch of transactions and the window model. The efficiency of the H-DAC algorithm is discussed in detail by evaluating the algorithm with different input data streams in experiments. Empirical analysis shows the efficiency and effectiveness of the proposed method by testing with different parameter settings (e.g., discriminative level \(\theta\), support threshold \(\varphi\), relaxation of \(\alpha\)). There are two functions used in the above code which are used for DAC-Tree and H-DACStream traversing, and rule ranking, respectively. The rule pruning is also done by adding only the non-redundant, confident and frequent rules to the structures.

Example 2

Consider a rule classification scenario in multiple market basket data streams. Each data stream shows the transactions (i.e., items purchased by one customer) in a separated market basket. Each CDAR indicates the group of items bought together (i.e., left side of the rule) in a market (i.e., right side of the rule). In the beginning, the training set in the first batch of multiple data stream transactions is processed and discriminative rules are discovered. Then, the discovered discriminative rules are checked for being non-redundant, confident and frequent to be added to the DAC-Tree. Next, the test set in the first batch of multiple data stream transactions is processed transaction by transaction to be matched with the rules in the DAC-Tree. This process is repeated for the matched rules in the H-DACStream. After that, the matched rules are ranked based on rule precedence and the new transactions (i.e., in the test dataset) are classified with their class label (i.e., its market basket). These training and test set procedures are repeated using cross-validation for measuring the classification accuracy. In the end, all the recent CDARs are set to the latest tilted window frame, and the tilted-time window model is shifted and merged with its older CDARs accordingly and the process continues with the new batch of transactions from multiple data streams (i.e., multiple markets).

4.3 Rule pruning

Compared to traditional association rules, the discriminative associative rules are sparse and smaller in number. Every discriminative associative rule has to be frequent first and then have a higher ratio in the target data stream vs the rest of the data streams. The second condition prunes many non-discriminative association rules. Although the discriminative itemsets are a small subset of frequent itemsets, the H-DAC algorithm still discovers a large set of rules, because classification data are typically highly correlated. We prevent the rules that are either redundant or misleading from taking any role in the prediction process of test data objects, making the classification process more effective and accurate.

At first, we omit the long rules by limiting the length of the generated rules to avoid noises in data streams. The longer rules are susceptible to the noises in datasets (i.e., lead to overfitting). Second, we use minimum confidence \(c\) and minimum support \(s\) to prune the rules. Third, the redundant rules are pruned before adding to the DAC-Tree. All attribute value combinations are considered in turn as a rule’s condition. Therefore, rules in the resulting classifier may share training items in their conditions, and for this reason, there could be several specific rules containing many general rules. The formal definition of rule redundancy is based on the substitution of the rule with smaller lengths (i.e., general rule) that have higher or equal confidence compared with the rules with larger lengths (i.e., sharing items with general rule). During the pruning method, we discard all the specific rules with fewer or equal confidences than general rules. Before adding the rules to the DAC-Tree, we prune all rules such as \(I^{\prime} \to c\), where there is some general rule \(I \to c\) of a higher rank and \(I \subseteq I^{\prime}\). Fourth, we prune the rules before updating the tilted-time window model by choosing the optimum set of rules in each batch using the Optimum-DAC-Tree. After rule pruning, we traverse the DAC-Tree and H-DACStream, respectively, and find the set of all rules that classify the new transaction. Then, we do the rule ranking based on rule precedence, as follows.

4.4 Rule ranking

A recursive function is defined for rule traversing. This function takes the \(Tr\), \(Tr\_len\) and \(root\) of the DAC-Tree or H-DACStream, respectively, as inputs, and then fills a stack with all the rules matched with the new \(Tr\), as output, in either tree. DACTraverse traverses the DAC-Tree or H-DACStream using all possible \(Tr\) subsets starting with any of the \(Tr\) items. During the recursive traversing whenever it reaches a rule node (i.e., the ones with four rule metrics) the rule is added to the stack, and the traverse continues with the other \(Tr\) subsets. To avoid repeatedly traversing the same rule more than once we negate the traversed \(Tr\) subsets. The recursive function then checks the \(Tr\) items before calling for \(Tr\) with a lower length. The DACTraverse algorithm is given in Algorithm 2.

figure b

The prefix trees used for saving the rules are small, and traversing them specifically using a recursive function is not time-consuming. The stack is filled by the true set of rules matched with each new transaction using the traversing function. The rules may belong to different streams. Each rule has its class label (stream identifier), confidence, discriminative value and support. The H-DACStreamTraverse goes through each of the tilted-time windows and adds the rules with the highest precedence to the same stack. It means, it may find several rules in different window frames with the same itemset in their rule conditions, but different class labels (i.e., data stream \(S_{i}\)). The algorithm chooses the high precedence rule based on rule precedence. After filling all the matched rules with the transaction in the stack, rule ranking is done based on rule precedence for predicting the right label for the transaction.

From the rules with the same class label (i.e., data stream \( S_{i}\), \(0 \le i < m\)), the one with higher precedence is selected. This rule nominates the class label. To avoid overfitting, we use a rule-counter for each class and then multiply the rule metrics (i.e., confidence, discriminative value and support) by this counter. The rule-counter for the rules in DAC-Tree is summed up with 1. The rule-counter for the rules in H- DACStream (i.e., current batch \( W_{0}\)) and the rules in other window frames (i.e.,\( W_{k}\), \(k > 0\)), is also summed up with 1. However, for the rules in window frames \( W_{k}\) (i.e., \(k > 0\)), it is summed up by the number of window frames in that rule appeared. Following this, the highest precedence rule in a class with a higher number of rules is ranked higher. If two rules have the same metrics, the algorithm chooses the one with the shortest length (i.e., generated earlier). In case of no rule matches with the transaction, the default class is chosen. We consider the class with the highest number of rules as the default class. After classifying each transaction in the test set of data streams, the H-DAC algorithm reads the next \(Tr\) and the process continues for the rest of the test data streams. The rule ranking algorithm is given in Algorithm 3.

figure c

The algorithm checks all the matched rules with transaction \(Tr\) using a nested loop. The rules in each class are checked once only and the one with the highest precedence and highest repetition is chosen. This is done for each transaction in the test data streams based on cross-validation. The accuracy of the built classifier is high and the algorithm is very efficient, especially for large data streams.

Our framework helps to solve the understandability of the rules in classification rule mining. We use discriminative itemsets that show the data stream differences. These are a small subset of class associative rules and are usually shorter in their length. In addition, to make a reliable and accurate prediction, the most confident rules may not always be the best choice. We use rule precedence based on itemsets that are all discriminative. However, confidence still precedes the discriminative value of the rules. To avoid overfitting, we use a rule-counter to multiply the rule metrics by this counter.

4.5 Quality rules

Understandably is a key issue in rule-based classification. In many classification techniques, domain-independent biases are used to generate rules and form a classifier. This results in many rules that make no sense to the user. We work on a classification rule mining framework that helps to solve the understandability of the rules. In our method, the discriminative itemsets have used that show the class differences. Many rules are obvious as they show the discrimination inherent in data streams. In every specific domain, these rules are easily understandable by users. Moreover, the number of mined rules and sometimes biased classification or overfitting in the associative classification techniques is cumbersome. The discriminative itemsets are a sparse subset of class associative rules and are usually shorter in their lengths. The advantage of the proposed classifier is mostly shown in the large-scale data streams. This is the superiority of our classification technique to the other classifiers when they fail or take long during rule mining. We do this by applying several pruning heuristics both during rule mining and during rule ranking processes. The rationale behind our pruning strategy is to only hold the rules reflecting strong implications to do classification. The rules that are not positively correlated are pruned as they seem to be noises. Moreover, the redundant rules that are with higher length and lower or equal confidence compared with the general rules are pruned. In addition, the most confident rules may not always be the best choice to make a reliable and accurate prediction. In our method, we defined the precedence of the rules based on itemsets that are all discriminative. However, still, confidence precedes the discriminative value of the rules. At first and in the case of higher confidence, the discriminative value is not considered. However, in the case of multiple rules with similar confidences, the rule with a higher discriminative value is considered. In our algorithm, the cardinality of rules is also considered and it chooses the highest precedence rule in the class that has higher cardinality of rules. By this, we avoid overfitting by prohibiting rule ranking only based on rules precedence.

5 Empirical evaluation

In this section, we evaluate our H-DAC classifier in case of accuracy, efficiency and scalability and sensitivity by performing an extensive experimental performance study using two real datasets from UCI ML Repository (Dua and Graff 2019). We run our algorithm using different discriminative values \(\theta\) and support threshold \(\varphi\). All the programming codes for implementing the algorithms were in C++ and on a desktop computer with an Intel Core (TM)2 Duo T6600 2.2 GHz CPU and 4 GB main memory running 64-bit Microsoft Windows 7 Enterprise experiments were conducted. In our baseline datasets, attributes can be categorical or continuous. For a categorical attribute, we assume that all the possible values are mapped to a set of consecutive positive integers. For a continuous attribute, we assume that its value range is discretized into intervals, and intervals are also mapped to consecutive positive integers. Discretization of continuous attributes is done using the Entropy method (Fayyad and Irani 1993). By doing so, all the attributes are treated uniformly in this study. tenfold cross-validation is used for every data stream.

Statistically, H-DAC shows high classification accuracy in data streams which are not pessimistic. For clarification, pessimistic data streams are the ones with the majority of the instances labeled as one specific class (e.g., more than 90%), and the rest of the instances labeled as other class(es). In these data streams, our algorithm reports the accuracy a bit better than random classification (i.e., classifies all the instances to the majority stream). In the pessimistic datasets, the class distribution of the instances is discriminative, caused by the rule precedence of the majority stream. Although we can do the exception handling based on the data stream characteristics (i.e., class distribution), we left the algorithm in its general form.

There are two important thresholds, support threshold \(\varphi\) and ratio \(\theta\) for discriminative itemsets in H-DAC. As discussed before, these two thresholds control the number of rules selected for classification. In general, if the set of rules is too small, some effective rules may be missed. On the other hand, if the rule set is too large, the training data stream may be overfitting. Thus, we need to test the sensitivity of the thresholds w.r.t classification accuracy.

To receive the best results out of the H-DAC algorithm, we select the best configuration for each dataset. The parameters are tuned using a greedy method. For setting any of the parameters, all the other parameters (i.e., support threshold, ratio and less important but still effective parameters including rule length, confidence and support), are held and we only change one of them to find the best-reported accuracy. The same process is followed for tuning the other parameters. This greedy method is repeated a couple more times for achieving a possible better accuracy. Empirically, the algorithm is fast and efficient and parameter tuning for the classifier does not take that much time, and it usually reaches the best setting in the first or second iteration.

5.1 Batch processing

Mining CDARs as a small subset of CARs in the H-DAC algorithm is very efficient for large datasets. After extensive research we chose two datasets, which are Adult and Susy datasets from the UCI repository (Dua and Graff 2019), to test our algorithm. The Adult is relatively a small dataset and is simulated as a small data stream. However, Susy is much larger and is simulated as a large data stream. Unfortunately, we couldn’t find any freely available real transaction-based multiple data streams (i.e., containing more than one class label in the same dataset) suitable for our classifier evaluation. However, in another research work (i.e., DAC method as explained in Sect. 5.2) we have an extensive comparison of different real datasets with state-of-the-art. Here, we mainly intend to show the usefulness of the tilted-time window model on H-DAC classifier accuracy improvement during the history of the data streams. At first, we ran the experiments only on a single batch of transactions. We use different parameter settings to find the best values for the parameters. There are two important thresholds, the support threshold and \(\varphi\) and ratio \(\theta\) for mining CDARs in H-DAC. As discussed before, these two thresholds control the number of rules selected for classification. In general, if the set of rules is too small, some effective rules may be missed. On the other hand, if the rule set is too large, the training dataset may be overfitting. Thus, we need to test the sensitivity of the thresholds w.r.t classification accuracy.

5.1.1 A single batch of the adult dataset

The Adult dataset is made of 48,842 instances and the first column is the class label followed by fourteen features. We selected the first 1526 instances (i.e., divided by 32) for the scale of the experiments (i.e., 3% of the dataset). As an example, we test different support threshold \(\varphi\) and ratio \(\theta\) values on the batch of the Adult dataset. As we explained in the previous section, we apply a greedy method for setting our algorithm parameters. This means that we run the algorithm in multiple iterations and in each iteration, we set all the parameters except the desired one. We change the desired parameter until we reach the best answer. Then we repeat our iterations for the next desired parameter until we set all the parameters with their best setting. The ratio \(\theta\) is set to 4 with different support thresholds. Moreover, we test different ratio \(\theta\) values on this dataset, where the \(\varphi\) is set to 0.015. We observed the best accuracy as 82.9% in this dataset for \(\theta = 4\) and \(\varphi = 0.015\). We noticed the best accuracy with the rule length limit of less than 6 for datasets. This improves the performance of the algorithm as well. In case of changing these parameters, the classifier will report less accuracy. Changing these parameters may be in favor of less time and space usage when the optimum parameter setting leads to a high time and space complexity. The accuracy results are shown in Fig. 4.

Fig. 4
figure 4

The effect of support threshold \(\varphi\) and ratio \(\theta\) on the accuracy of the Adult dataset

From the above figure, we can see that there are optimal settings for both thresholds. However, according to our experiment results, there seems no way to pre-determine the best threshold values. Fortunately, both curves are quite plain. That means the accuracy is not very sensitive to the two threshold values. The time and space complexity are shown in Figs. 5 and 6, respectively.

Fig. 5
figure 5

The time and space complexity of H-DAC with different support threshold \(\varphi\) on the Adult dataset and \(\theta = 4\)

Fig. 6
figure 6

The time and space complexity of H-DAC with different ratios \(\theta\) on the Adult dataset and \(\varphi = 0.015\)

5.1.2 A single batch of Susy dataset

The Susy dataset is made of five million instances and the first column is the class label followed by eighteen features. We selected the first fifty thousand instances for the scale of the experiments (i.e., 1% of the dataset). We test different support threshold \(\varphi\) and ratio \(\theta\) values on the batch of Susy dataset. The ratio \(\theta\) is set to 1.5 with a different support threshold. Moreover, we test different ratio \(\theta\) values on this dataset, where the \(\varphi\) is set to 0.0325. We noticed the best accuracy as 69.67% in this dataset for \(\theta = 1.5\) and \(\varphi = 0.0325\). We observed the best accuracy with the rule length limit of less than 4 for datasets. The accuracy results are shown in Fig. 7.

Fig. 7
figure 7

The effect of support threshold \(\varphi\) and ratio \(\theta\) on the accuracy of the Susy dataset

The time and space complexity are shown in Figs. 8 and 9, respectively.

Fig. 8
figure 8

The time and space complexity of H-DAC with different support threshold \(\varphi\) on the Susy dataset and \(\theta = 1.5\)

Fig. 9
figure 9

The time and space complexity of H-DAC with different ratios \(\theta\) on the Susy dataset and \(\varphi = 0.0325\)

H-DAC algorithm is fast and a great amount of the time complexity is consumed during its cross-validation process. It is clear from the above figures that our algorithm is capable of classifying small and large datasets with reasonable time and space usage and good accuracy. We observed a linear growth in the algorithm time and space usage by increasing the size of the datasets (e.g., we repeated the experiments on Susy datasets with different smaller and larger sizes). As it can be seen from the scalability experiments, the greater number of rules (with smaller \(\theta\) and \(\varphi\)) caused lower accuracy (i.e., because overfitting is based on longer rules with lower support and discriminative value). Therefore, the best parameter setting is usually something in between, which leads to an acceptable time and space usage. The main advantage of our algorithm is using discriminative itemsets. Every selected rule is discriminative. With this many unnecessary rules are not generated during the rule mining or simply pruned during rule pruning and rule ranking. The discriminative value of the discovered rules is the second property for rule ranking after rule confidence. In the following, we test the H-DAC algorithm in data stream classification using the tilted-time window model.

5.2 Tilted-time window processing

The Susy and Adult datasets are modeled as 32 continuous batches in the same sizes (i.e., for the sake of clarity) and belong to data stream \(S_{1}\) and data stream \(S_{2}\), respectively. The ratio between the size of \(S_{1}\) and \(S_{2}\) is also the same for all 32 batches (i.e., \(n_{2} /n_{1}\)). The Adult data stream is made of 32 continuous batches of the size of about 1526 records belonging to two class labels (i.e., considered as two data streams), for the scale of the experiments. The Susy data stream is also made of 32 continuous batches of size 50,000 records belonging to two data streams. The scalability of H-DAC is presented within offline updating of the tilted-time window model after processing each batch of transactions.

It is assumed that while the new batch is loaded with transactions the H-DACStream updating can be done by processing the current batch of transactions. This works well as far as the algorithms are faster than the rate of incoming data streams. Moreover, it is supposed that instances in a part of each incoming batch of data streams are labeled. Our algorithm trains and tests based on these labeled transactions, and updates the tilted-time window model based on the CDARs in optimum DAC-Tree. It then does the prediction for the rest of the unlabeled instances in the batch. The number of rules in the batches (i.e., presented in \(W_{k}\), \(k \ge 0\)) is different because of the distributions of the transactions. The embedded knowledge and the trends in data streams change over time as the concept drifts. As we mentioned earlier, our proposed H-DAC method is the first algorithm working on multiple data stream classification. As a result, the only proper way we could evaluate our method is to use the batch-based classifier as the baseline. We have extensively evaluated our batch-based classifier (i.e., called DAC method working only on static datasets) with the state-of-the-art in another research work (i.e., the DAC paper is currently under review at another journal). Over there we have compared the DAC method with several other rule-based classifiers including C4.5, CBA, CMAR, CPAR, L3, etc. In DAC research work, the comparisons were done on different types of datasets (i.e., totally 26 datasets from different domains) with different numbers of tuples, columns and class labels.

5.2.1 Adult data stream

In this section, during all experiments the discriminative level \( \theta = 4\) and support threshold \(\varphi = 1.5\%\) and the scalability of the algorithms are tested based on different relaxations of \(\alpha\). Here, we compared the accuracy of the H-DAC using the tilted-time window model, with batch processing only (i.e., no history in the data streams), as shown in Fig. 10.

Fig. 10
figure 10

The accuracy of the Adult data stream using the tilted-time window model

We see improvements in classification accuracy (i.e., between 0 and 2%) using the tilted-time window model in most of the batches. We predict more improvements in case of a greater number of incoming batches. The scalability of the H-DAC algorithm is represented in Fig. 11, for mining CDARs in data streams using the tilted-time window model.

Fig. 11
figure 11

The time complexity of the H-DAC algorithm

The H-DACStream size as the biggest data structure in the designed algorithms is presented in Fig. 12. Following the compact logarithmic tilted-time window model and by applying the tail pruning, the H-DACStream size stays small as an in-memory data structure. The periodic drops in the size of the structure are caused by merging the tilted window frames.

Fig. 12
figure 12

H-DACStream structure size

In this section, the scalability of the algorithm, as a highly accurate and highly efficient method for mining CDARs in data streams using the tilted-time window model with the highest refined approximate bound, is evaluated with different relaxation of \(\alpha\), for the highest refined approximate bound. The accuracy results are shown in Fig. 13.

Fig. 13
figure 13

The accuracy of the Adult data stream using the tilted-time window model using the relaxation of \(\alpha = 0.9\)

The H-DAC time usage and the H-DACStream size are represented in Fig. 14 by a different setting for the relaxation of \(\alpha\), i.e., \(\alpha = 1\) and \(\alpha = 0.9\). The H-DAC have improvements in a few batches by relaxation of \(\alpha = 0.9\) (i.e., with improvement in approximate CDARs), compared with the relaxation of \(\alpha = 1\).

Fig. 14
figure 14

Scalability of H-DAC algorithm by relaxation of \(\alpha = 1\) and \(\alpha = 0.9\)

5.2.2 Susy data stream

In this section, during all experiments the discriminative level \( \theta = 1.5\) and support threshold \(\varphi = 3.25\%\) and the scalability of the algorithms are tested based on different relaxations of \(\alpha\). The accuracy results are shown in Fig. 15.

Fig. 15
figure 15

The accuracy of the Susy data stream using the tilted-time window model

The scalability of the H-DAC algorithm is represented in Fig. 16, for mining CDARs in data streams using the tilted-time window model. The H-DACStream size is also presented in Fig. 17.

Fig. 16
figure 16

The time complexity of the H-DAC algorithm

Fig. 17
figure 17

H-DACStream structure size

The H-DAC algorithm is scalable for both small and large data streams. It further improves the classification accuracy based on the discriminative rules (i.e., CDARs) discovered during the processing of the latest batch of transactions in multiple data streams, and the ones during tilted-time window model updating. We set the discriminative level \(\theta\) and support threshold \(\varphi\) based on their proper values for the first batch of transactions in data streams. Considering the proper rule length is also important both for the classifier’s accuracy and scalability. Less important but still effective are the rule’s confidence and support. These all will remain the same during the data stream processing in the tilted-time window model. The most important superiority of the H-DAC algorithm in data stream processing compared to batch processing is the improvement in the classifier’s accuracy with very little overhead for window model shifting and merging. In our two experiments, we limited the batch number to 32 batches. We expect better accuracy with an increasing number of batches during the history of the multiple data streams.

5.3 Discussion

The general outcome from experiments on the H-DAC algorithm discussed in previous subsections is explained here. The H-DAC algorithm exhibits good improvements in accuracy using the tilted-time window model, as compared with batch processing (i.e., not considering the data streams’ history). Moreover, it shows efficient time and space complexity for mining CDARs using the tilted-time window model. The highest refined approximate bound in discriminative itemsets in the tilted-time window model is obtained efficiently based on three corollaries proposed in Seyfi et al. (2021a, b). Setting the relaxation of \(\alpha\) in accompaniment with other parameters (i.e., support threshold \(\varphi\), discriminative level \(\theta\) and input batch size \(n_{1}\)) is very important in real applications. Proper size has to be considered for the current window frame (i.e., \(W_{0}\)) for updating the tilted-time window model in reasonable time intervals. This is highly related to the application and domain experts by considering limited computing and storage capabilities and the approximate bound in the false-positive discriminative itemsets.

The changes in trend based on the concept drifts in batches are neutralized quickly in the tilted-time window model, and the in-memory H-DACStream structure is held efficiently during the lifetime of data streams. This is scalable with tail pruning in the compact logarithmic tilted-time window model. The main part of time and space complexity in the algorithms is related to the batch processing, although the tail pruning of non-discriminative rules that appeared in the old batches can add complexity to the algorithms as well.

The most preferable feature of the proposed method is that it uses a smaller number of rules caused for the efficiency of the algorithm in case of time and space usage. Although H-DAC looks good in the case of accuracy, the main purpose of presenting this method is its scalability (and at the same time its understandability) for the classification of large-scale data streams.

6 Conclusion and future works

In this paper, we proposed an efficient single-pass algorithm for finding the class discriminative association rules in multiple data streams based on the tilted-time window model. These rules focus on defining each stream distinctly in comparison to the rest of the streams in different periods. The proposed algorithm introduces an in-memory prefix-tree structure, H-DACStream. It extracts a set of rules in data streams with good approximation. The historical data structures generated during the process were able to be fitted into the main memory. The classifier builds based on most precedence rules with the highest repetition. The proposed method has been extensively evaluated with data streams exhibiting distinct characteristics. Results show that this method is realistic and effective in fast-growing data streams. Here, the discriminative itemsets are updated in offline time intervals in the tilted-time window model. In future, we propose developing the algorithm for mining the class discriminative association rules in multiple data streams using the sliding window model.