1 Introduction

Learning from data streams is one of the most rapidly developing fields in the contemporary machine learning (Sun 2008; Ditzler et al. 2015). This is motivated by a plethora of real-world applications in which data arrives continuously and floods the system. This calls for developing new algorithms that are able to handle the ever-growing data volume and constantly update their structure within time and resource limits. Additionally, data streams may be subject to changes over time, a phenomenon known as concept drift (Gama et al. 2014). Such a change point must be detected as soon as possible in order to handle the drift appropriately and allow for fast recovery of the system. Data streams are strongly connected with the recently emerging paradigm of continual learning, where it is assumed that the machine learning model must be capable of continuous self-improvement and accumulation of new knowledge (Parisi et al. 2019). It is interesting to note that data streams are de facto a task-free continual learning scenario (Aljundi et al. 2019).

Vast majority of data stream mining algorithms are designed only for vector representation of input data. This representation is not a proper one for many real-world problems that generate multi-dimensional data with dependencies between different dimensions, such as computer vision (Yang et al. 2018) or social networks (Nakatsuji et al. 2017). Although one may easily vectorize such information, it will lead to loss of information, as relationships between factors in the input space will not be preserved (Gu et al. 2018). In order to overcome this limitations a tensor representation has been proposed, where input data is stored as multi-dimensional cubes that preserve the dependencies between factors (Lathauwer 2009; Fu et al. 2015). Tensors gained popularity in various areas of machine learning and data mining (Sidiropoulos et al. 2017; Maruhashi et al. 2018), but their application to data streams is very limited. At the same time, many modern data sources generate multi-dimensional data streams (Mardani et al. 2015) and these areas may definitely benefit from dedicated tensor-based streaming algorithms (Shin et al. 2017; Song et al. 2017). Most of existing works focus on tensor factorization (Smith et al. 2018), using stochastic descent approaches (Mardani et al. 2015), Tucker model (Sun et al. 2008), and online canonical polyadic decomposition (Smith et al. 2018). At the same time, best to our knowledge, the field of tensor classification has not been studied in the data stream setup, especially in the context of concept drift. This paper aims at bridging this gap and proposing an efficient framework for data stream classification with tensor input.

1.1 Goal

To propose a novel continual decision tree induction technique that allows for learning from drifting data streams using tensor-based data representation.

1.2 Motivation

Among classifiers dedicated to data streams, decision trees have gained a significant attention, due to their excellent capabilities for incremental learning by creating new splits with arriving instances, high classification accuracy, and low model complexity (Ditzler et al. 2015). However, existing decision trees for data streams work only with vector representation. This limits their applicability to modern data sources, such as texts or images. Vectorization of such data leads to a significant loss of information. Thus it is beneficial to use tensor-based representation of such data that maintains all the properties of such complex data. However, current techniques dedicated for tensor classification are not suitable for data streaming scenarios, nor poses any mechanisms to handle the presence of concept drift. The same holds for modern deep learning architectures that, while being extremely effective for static tensor data, cannot handle velocity and rapidly evolving nature of data streams (Sahoo et al. 2018).

1.3 Overview

In this paper, we propose a novel framework classifying data streams with tensor representation. We introduce a decision tree learning scheme capable of handling tensors directly, without a need for vectorization. at the same time, our proposal maintains all the advantages of decision trees. We achieve this by training classifiers in the similarity space that is defined by a kernel using tensor representation. Chordal distance allows to measure a similarity between two tensors and may be used to construct a kernel feature space, which in turns allow for induction of a decision tree directly from tensors. Additionally, we propose a concept drift detection scheme working with tensor representation. It allows to effectively detect the moment of change and update our model in two ways: (1) by reconstructing the kernel feature space using new instances; and (2) by retraining the decision tree on the current concept and new feature space. Experimental study shows the efficacy of the proposed approach and its wide usability in various data stream classification scenarios, where tensor representation is required.

1.4 Main contributions

This paper offers following insights into learning from drifting data streams with complex data:

  • Chordal Kernel Decision Tree We propose a novel decision tree classifier (CKDT) for continual learning from drifting data streams with data arriving in tensor form. CKDT is a full adaptive classifier, capable of both continual accumulation of new knowledge from arriving tensors, as well as flexible adaptation to drifts in the stream, when previously learned concepts become outdated. CKDT uses McDiarmid’s inequality to control the continual splitting procedure from streaming tensor data.

  • Adaptive tensor kernel similarity space We introduce a kernel similarity space for continual induction of decision trees from tensor data streams. A subsampled kernel is used to create a new tensor-based representation that allows for continual learning from tensor data streams. We present a mechanism for rebuilding the kernel space whenever concept drift occurs, allowing for adaptive feature crafting from evolving data.

  • First concept drift detector for tensors We propose a simple, yet effective tool for monitoring properties of tensors incoming from the data stream. This allows us for early detection of any changes in tensor properties, allowing for cost-efficient adaptation of CKDT whenever streams becomes subject to significantly strong changes. The proposed drift detector works directly on tensor representation of data.

  • Experimental study We provide a detailed experimental benchmark on drifting tensor data streams, comparing the proposed approach with three state-of-the-art method for incremental tensor classification. We use 4 real-world and 52 artificial tensor data stream benchmarks that capture various domains and learning difficulties.

2 Background

2.1 Learning from data streams

We will now provide a short background for data stream setting in the context of machine learning.

Definition 1

(Data stream) Data stream is a sequence \(<S_1, S_2,\ldots , S_n,\ldots>\), where each element \(S_n\) is a new instance arriving over time. Each instance in the stream is independent and randomly drawn from a stationary probability distribution \(\Psi _n ({\mathbf {x}},y)\). Data stream is a task-free continual learning problem (Aljundi et al. 2019).

Definition 2

(Concept drift) Concept drift is a phenomenon that influences estimated decision rules or classification boundaries, reducing or voiding their relevance to the new state of the stream. Real concept drift influences the conditional probabilities \(p_j(y|{\mathbf {x}})\) over time.

Concept drift has crucial impact on the learning system and must be handled as soon as it occurs (Gama et al. 2014). There are three main approaches for handling this learning difficulty. The first one relies on an external tool, known as concept drift detector. It monitors the properties of stream and informs when a significant change takes place in order to rebuilt the model. This solution is often combined with decision trees. The second one uses a sliding window that keeps most recent instances in the temporal memory, using them as the current representation of the stream. The third one relies on online classifiers and ensemble models (Krawczyk et al. 2017) that adapt to changes on their own, resulting in an implicit drift detection.

2.2 Tensors in machine learning and classification

We will define now the basic notations for representation and classification of data coming in the form of tensors.

Definition 3

(Tensor) A tensor is a L-dimensional cube of real valued data, where each individual dimension represents a different factor in the input data space:

$$\begin{aligned} {{\mathcal {A}}} \in {\mathfrak {R}^{{N_1} \times {N_2} \times \ldots {N_L}}} \end{aligned}$$
(1)

The j-mode of the K-th order tensor (tensor order standing for its number of directions/dimensionality) is a vector that is calculated from \({\mathcal {A}}\) by manipulating selected dimension \(k \in \{1,2,\cdots ,N_{j}\}\), while remaining dimensions are intact.

Definition 4

(Tensor flattening) A j-mode tensor flattening (known also as tensor matricization) is matrix A\(_{(\textit{j})}\) for which its columns are j-mode vectors of \({\mathcal {A}}\):

$$\begin{aligned} {{\mathbf{A}}_{\left( j \right) }} \in {\mathfrak {R}^{{N_j} \times \left( {{N_1}{N_2} \ldots {N_{j - 1}}{N_{j + 1}} \ldots {N_L}} \right) }} \end{aligned}$$
(2)

The j-th index is a row index of A\(_{(\textit{j})}\), while a product of all remaining L-1 indices is its column index.

Definition 5

(Tensor product) A p-mode product of a tensor \({{\mathcal {A}}} \in {\mathfrak {R}^{{N_1} \times {N_2} \times \ldots {N_L}}}\) with matrix \({\mathbf{M}} \in {\mathfrak {R}^{Q \times {N_p}}}\) creates a tensor \({{\mathcal {B}}} \in {\mathfrak {R}^{{N_1} \times {N_2} \times \ldots {N_{p - 1}} \times Q \times {N_{p + 1}} \times \ldots {N_L}}}\) with elements:

$$\begin{aligned} {{{\mathcal {B}}}_{{n_1}{n_2} \ldots {n_{p - 1}}q{n_{p + 1}} \ldots {n_L}}}= & {} {\left( {{{\mathcal {A}}}{ \times _p}{\mathbf{M}}} \right) _{{n_1}{n_2} \ldots {n_{p - 1}}q{n_{p + 1}} \ldots {n_L}}} \nonumber \\= & {} \sum \limits _{{n_p} = 1}^{{N_p}} {{a_{{n_1}{n_2} \ldots {n_{p - 1}}{n_p}{n_{p + 1}} \ldots {n_L}}}{m_{q{n_p}}}.} \end{aligned}$$
(3)

where \(a_{n_1n_2...n_L}\) is an element of \({\mathcal {A}}\) at index \((n_1,n_2,...,n_L)\) and analogously \(m_{qn_p}\) is an element of \(\mathbf{M}\) at index \((q,n_p)\).

The p-mode product can be represented in an equivalent manner as flattened tensors A\(_{(\textit{p})}\) and B\(_{(\textit{p})}\). Assuming the following holds:

$$\begin{aligned} {{\mathcal {B}}} = {{\mathcal {A}}}{ \times _p}{\mathbf{M}} \end{aligned}$$
(4)

then

$$\begin{aligned} {{\mathbf{B}}_{\left( p \right) }} = {\mathbf{M}}\,{{\mathbf{A}}_{\left( p \right) }} \end{aligned}$$
(5)

Each distinctive tensor flattening creates an unique matrix with specific properties. Therefore, by analyzing each flattening A\(_{(\textit{j})}\) we obtain an unique perspective on \({{\mathcal {A}}}\) from j-th dimension. We will use this property to construct a tensor-based kernel for data stream representation, which will be discussed in details in the next subsection.

Definition 6

(Singular Value Decomposition) SVD is a procedure for analyzing the properties of each flattening as follows:

$$\begin{aligned} {{\mathbf{A}}_{\left( j \right) }}= & {} {{\mathbf{S}}^{\left( j \right) }}{{\mathbf{V}}^{\left( j \right) }}{{\mathbf{D}}^{\left( j \right) ^T}} = \sum \limits _{i = 1}^{{R_{{A_{(j)}}}}} {v_i^{\left( j \right) }{\mathbf{s}}_i^{\left( j \right) }{\mathbf{d}}_i^{\left( j \right) ^T}} \nonumber \\= & {} \left[ {\begin{array}{*{20}{c}} {{\mathbf{S}}_{{\mathbf{A}},1}^{\left( j \right) }}&{{\mathbf{S}}_{{\mathbf{A}},2}^{\left( j \right) }} \end{array}} \right] \left[ {\begin{array}{*{20}{c}} {{\mathbf{V}}_{{\mathbf{A}},1}^{\left( j \right) }}&{}{\mathbf{0}}\\ {\mathbf{0}}&{}{\mathbf{0}} \end{array}} \right] \left[ {\begin{array}{*{20}{c}} {{\mathbf{D}}_{{\mathbf{A}},1}^{\left( j \right) ^T}}\\ {{\mathbf{D}}_{{\mathbf{A}},2}^{\left( j \right) ^T}} \end{array}} \right] . \end{aligned}$$
(6)

where \(_{\mathbf{A },1}\) and \(_{\mathbf{A },2}\) denote respectively indices of block matrices related to the kernel and null spaces of \({{\mathbf{A}}_{\left( j \right) }}\). \({\mathbf{S}}_{{\mathbf{A}},1}^{\left( j \right) }\) and \({\mathbf{D}}_{{\mathbf{A}},1}^{T\left( j \right) }\) are unitary matrices of the kernel of \({{\mathbf{A}}_{\left( j \right) }}\). \({\mathbf{V}}_{{\mathbf{A}},1}^{\left( j \right) }\) is a diagonal matrix with R \(_{A}\) non-zero elements.

By assuming this definition of SVD, it follows that:

$$\begin{aligned} {\mathbf{D}}_{{\mathbf{A}},1}^{\left( j \right) ^T}{\mathbf{D}}_{{\mathbf{A}},1}^{\left( j \right) } = {{\mathbf{I}}_{{R_{\mathbf{A}}} \times {R_{\mathbf{A}}}}} \end{aligned}$$
(7)

Analogous properties are preserved for j-th mode flattening of the tensor \({{\mathcal {B}}}\). However, its rank may be different and thus we will denote it as R \(_{B}\).

In this work, we focus the task of tensor classification, i.e., assigning a class label to an input tensor (Li and Schonfeld 2014).

Definition 7

(Tensor classification) This task aims at creating a classifier defined as a function \(\Psi \) with domain \({{\mathcal {A}}}\) and codomain \(\mathcal {M}\):

$$\begin{aligned} \Psi : {{\mathcal {A}}} \rightarrow \mathcal {M}, \end{aligned}$$
(8)

where \(\mathcal {M} = \{1, \cdots , M\}\) stands for a set of class labels.

2.3 Related works for streaming tensor analysis

Streaming tensors have been considered in the literature mainly from the perspective of a singular tensor that evolves over time (Yang et al. 2021). This may include changes in existing dimensions / factors (Chhaya et al. 2020), or emergence of new ones over time (Letourneau et al. 2018). CP decomposition has been successfully used for streaming tensors, either based on simultaneous diagonalization, or weighted least squares that track the online third-order decomposition. (Rambhatla et al. 2020) Other approaches use grid division for large streaming tensors and using local factorization independently for each sub-tensor (Gujral et al. 2020). OnlineCP (Zhou et al. 2016) incrementally tracks CP decomposition of streaming tensor with arbitrary modes. There exist also Tucker decomposition methods for online tensor analysis that can be effectively used under streaming conditions (Sun et al. 2020). From the data stream mining perspective, there exist a plethora of effective classification and drift detection methods, but all of them are dedicated to shallow vector representations and therefore cannot properly capture multi-dimensional relationships in tensor data(Pinage et al. 2020)(Zyblewski et al. 2021).

3 Decision tree learning for tensor data streams

3.1 Decision trees in the era of deep learning

Deep learning have dominated the world of learning from complex and high-dimensional data, offering unparalleled predictive and generative capabilities power. However, research in traditional (shallow) machine learning algorithms is still as vibrant as ever, due to a number of limitations of current deep learning architectures in specific learning scenarios. This is especially visible for learning from data streams, where existing deep architectures have difficulties with handling the presence of concept drift (Sahoo et al. 2018), or their adaptation mechanisms, while well-designed, are too slow for high-speed data streams (Ashfahani and Pratama 2019). Decision trees are well-known and attractive learning algorithms for data streams, offering low computational cost with excellent adaptation capabilities to concept drift (Gomes et al. 2019). Furthermore, they are explainable and interpretable models, offering white-box approach for streaming data. While their predictive power is weak on their own, they can be efficiently combined in ensemble architectures, leading to significant increase in their accuracy (Krawczyk et al. 2017). All this factors motivate us to develop novel decision tree model that is capable of learning from tensor data streams under concept drift.

3.2 Proposed algorithm overview

We present the details of Chordal Kernel Decision Tree (CKDT) for continual learning from tensor data streams. We discuss the used decision tree model for unbounded data streams, the usage of kernel feature space designed for working with tensor representation, as well as concept drift detection from tensor data. Overview of the proposed CKDT algorithm is presented in pseudo-code form in Fig. 1.

Fig. 1
figure 1

Pseudocode of the proposed Chordal Kernel Decision Tree

3.3 Decision tree induction from streaming data

Decision tree induction algorithms for data streams are based on Hoeffding inequality in order to determine what number of new instances is sufficient to conduct a new split. Recent study highlighted the existing flaws in the Hoeffding bound (Rutkowski et al. 2013), showing its potential for incorrect calculations. Therefore, in this work we use McDiarmid’s inequality for decision tree induction from streaming data. It can be seen as a generalized version of Hoeffding’s inequality, more capable of handling various types of input data and measuring the split quality.

Theorem 1

(McDiarmid’s Theorem) We define \(X_1,\cdots ,X_n\) as a set of independent random variables and define a function \(f(x_1,\cdots ,x_n)\) that fulfills inequality :

$$\begin{aligned} \begin{array}{l} \sup _{x_1,\cdots ,\hat{x_i}} | f(x_1,\cdots ,x_i,\cdots ,x_n) - f(x_1,\cdots ,\hat{x_i},\cdots ,x_n)| \\ \le c_i, \forall _{i = 1,\cdots ,n} . \end{array} \end{aligned}$$
(9)

For any \(\epsilon > 0\) the following is true:

$$\begin{aligned} \begin{array}{l} Pr \left( f(X_1,\cdots ,X_n) - E\left[ f(X_1,\cdots ,X_n) \right] \ge \epsilon \right) \\ \le \exp \left( - \frac{2\epsilon ^2}{\sum _{i = 1}^{n} c_i^2} \right) = \delta . \end{array} \end{aligned}$$
(10)

McDiarmid’s inequality can be used in combination with any splitting measure to estimate the lowest number of instances n sufficient to conduct a split when new data arrives. It has been shown to work well with Gini gain (Rutkowski et al. 2013), thus we use this metric. Gini gain is defined as:

$$\begin{aligned} \varDelta g_i^G(S) = g^G(S) - \sum _{q \in \{L,R\}} \frac{n_{q,i}(S)}{n(S)} \left( 1 - \sum _k^K \left( \frac{n_{q,i}^k(S)}{n_{q,i}(S)}\right) ^2\right) , \end{aligned}$$
(11)

where S is a set of instances in the current tree node, L and R are left and right child nodes, \(n_{q,i}(S)\) is the number of instances in the current node that will go to q-th child node if the split will be conducted on i-th feature, and \(n^k_{q,i}(S)\) is the number of instances from k-th class that will be passed to q-th child node if the split will be conducted on i-th feature.With this we may formulate McDiarmid’s inequality for computing and comparing Gini gains for any two selected features.

Theorem 2

(McDiarmid’s Inequality for Gini Gain) Let \(\varDelta g_h^G(S)\) and \(\varDelta g_i^G(S)\) be the Gini gain values (see Eq. 11) for h-th and i-th considered feature. If the condition is satisfied:

$$\begin{aligned} \varDelta g_h^G(S) - \varDelta g_i^G(S) > \sqrt{\frac{8 \ln (1/\delta )}{n(S)}}, \end{aligned}$$
(12)

then the inequality holds with probability of \(1 - \delta \) or higher:

$$\begin{aligned} E[\varDelta g_i^G(S)] > E [\varDelta g_h^G(S)]. \end{aligned}$$
(13)

Theorem 3

(McDiarmid’s Split Criterion for Gini Gain) We assume that \(\varDelta g_{i_1}^G(S)\) and \(\varDelta g_{i_2}^G(S)\) are the metric values for features with respectively highest and second highest Gini gain. If the condition is satisfied:

$$\begin{aligned} \varDelta g_{i_1}^G(S) - \varDelta g_{i_2}^G(S) > \sqrt{\frac{8 \ln (1/\delta )}{n(S)}}, \end{aligned}$$
(14)

then following Theorem 2, with the probability equal to \((1 - \delta )^{d-1}\) the following statement is true:

$$\begin{aligned} i_{1} = \arg \max _{i = 1,\cdots ,d} \left\{ E[g_{i}^G(S)] \right\} , \end{aligned}$$
(15)

where d is the number of features and \(i_{1}\)-th feature is selected to split the current node.

3.4 Induction of decision trees in kernel feature space

Existing decision tree induction algorithms, including the presented one using McDiarmid’s Inequality, work only with vector inputs. Therefore, it is not possible to apply them directly on tensor data without conducting vectorization. In order to alleviate this drawback and extend the applicability of decision trees to tensor data streams, we propose to conduct the tree induction procedure in an alternative feature space. We need a simple, yet efficient representation of tensors that will maintain their multi-dimensional properties and relationships among different factors. In this paper, we propose to construct the new feature space using kernels.

A kernel \(\mathcal {K}\) can be used to transform the original feature space into a projected space \(\varphi _\mathcal {K}(\mathcal {X})\) such that \(\mathcal {K}(x,y) = \langle \varphi _\mathcal {K}(x), \varphi _\mathcal {K}(y)\rangle \). Kernels are tricky to use in data stream scenarios, as they require a computation of the whole Gram matrix, which is of size \(\mathcal {O}(N^2)\). In order to speed up the computations, one may use a random sampling of the input instances to create a new projected feature space. By sampling s instances from the stream, one may create a subsampled kernel:

$$\begin{aligned} \varphi ^\mathrm {rand}_\mathcal {K}(x) = [\mathcal {K}(x, x^1), ..., \mathcal {K}(x, x^s)]^T. \end{aligned}$$
(16)

One must note that this is fundamentally different from sampling the incoming instances from the stream, as all of them will be used for decision tree induction and incremental update - the subsample is only used for a faster computation of the new feature space. This allows for a significant reduction of feature space projection computational complexity.

Now one required a proper kernel that is capable of working directly with tensor representation. For this, we propose to use chordal distance kernel, capable of returning pure tensor-based similarities that will allow us to span a new feature space for the decision tree induction.

3.5 Chordal distance kernel for tensor similarity

Definition 8

(Chordal distance) A chordal distance (Signoretto et al. 2011) is defined as a similarity between two tensors represented by their j-th flattened mode matrices \({{\mathbf{A}}_{\left( j \right) }}\) and \({{\mathbf{B}}_{\left( j \right) }}\):

$$\begin{aligned} D_{ch}^2\left( {{{\mathbf{A}}_{\left( j \right) }},{{\mathbf{B}}_{\left( j \right) }}} \right) = D_F^2\left( {{\Pi _{{{\mathbf{A}}_{\left( j \right) }}}},{\Pi _{{{\mathbf{B}}_{\left( j \right) }}}}} \right) = \left\| {{\Pi _{{{\mathbf{A}}_{\left( j \right) }}}} - {\Pi _{{{\mathbf{B}}_{\left( j \right) }}}}} \right\| _F^2 \end{aligned}$$
(17)

where \({\Pi _{{{\mathbf{A}}_{\left( j \right) }}}}\) stands for a projection matrix of \({{\mathbf{A}}_{\left( j \right) }}\):

$$\begin{aligned} {\Pi _{{{\mathbf{A}}_{\left( j \right) }}}} = {\mathbf{D}}_{{\mathbf{A}},1}^{\left( j \right) }{\mathbf{D}}_{{\mathbf{A}},1}^{T\left( j \right) } \end{aligned}$$
(18)

Then one may insert Eq. 17 into Eq. 18, obtaining:

$$\begin{aligned} D_{ch}^2\left( {{{\mathbf{A}}_{\left( j \right) }},{{\mathbf{B}}_{\left( j \right) }}} \right) = \left\| {{\mathbf{D}}_{{\mathbf{A}},1}^{\left( j \right) }{\mathbf{D}}_{{\mathbf{A}},1}^{T\left( j \right) } - {\mathbf{D}}_{{\mathbf{B}},1}^{\left( j \right) }{\mathbf{D}}_{{\mathbf{B}},1}^{T\left( j \right) }} \right\| _F^2 \end{aligned}$$
(19)

Definition 9

(Chordal kernel) A chordal distance-based kernel (Signoretto et al. 2011) can be formulated as follows:

$$\begin{aligned} {K_j}\left( {{{\mathcal {A}}},{{\mathcal {B}}}} \right)= & {} \exp \left( { - \frac{1}{{2{\sigma ^2}}}D_{ch}^2\left( {{{\mathbf{A}}_{\left( j \right) }},{{\mathbf{B}}_{\left( j \right) }}} \right) } \right) \nonumber \\= & {} \exp \left( { - \frac{1}{{2{\sigma ^2}}}\left\| {{\mathbf{D}}_{{\mathbf{A}},1}^{\left( j \right) }{\mathbf{D}}_{{\mathbf{A}},1}^{T\left( j \right) } - {\mathbf{D}}_{{\mathbf{B}},1}^{\left( j \right) }{\mathbf{D}}_{{\mathbf{B}},1}^{T\left( j \right) }} \right\| _F^2} \right) . \end{aligned}$$
(20)

which allows to formulate a kernel for a L-dimensional tensor product (Cyganek et al. 2015):

$$\begin{aligned} K\left( {{{\mathcal {A}}},{{\mathcal {B}}}} \right)= & {} \prod \limits _{j = 1}^L {{K_j}\left( {\mathcal{A},{{\mathcal {B}}}} \right) }\nonumber \\= & {} \prod \limits _{j = 1}^L {\exp \left( { - \frac{1}{{2{\sigma ^2}}}\left\| {{\mathbf{D}}_{{\mathbf{A}},1}^{\left( j \right) }{\mathbf{D}}_{{\mathbf{A}},1}^{T\left( j \right) } - {\mathbf{D}}_{{\mathbf{B}},1}^{\left( j \right) }{\mathbf{D}}_{{\mathbf{B}},1}^{T\left( j \right) }} \right\| _F^2} \right) } \end{aligned}$$
(21)

Computation of Eq. 21 requires computation of 2\(\cdot {}\)L SVD decompositions. This makes it prohibitive to be used in the considered scenario of learning from data streams, as new tensors will continuously arrive and latency in their processing must be avoided. However, one may simplify this computation as follows. We start by denoting the squared norm in Eq. 19 as:

$$\begin{aligned} {\left\| {{\mathbf{P}} - {\mathbf{Q}}} \right\| ^2} = Tr\left( {{{\mathbf{P}}^T}{\mathbf{P}}} \right) - 2Tr\left( {{{\mathbf{P}}^T}{\mathbf{Q}}} \right) + Tr\left( {{{\mathbf{Q}}^T}{\mathbf{Q}}} \right) \end{aligned}$$
(22)

where \( Tr(.)\) stands for matrix trace, and P and Q are defined as:

$$\begin{aligned} {\mathbf{P}} = {\mathbf{D}}_{{\mathbf{A}},1}^{\left( j \right) }{\mathbf{D}}_{{\mathbf{A}},1}^{T\left( j \right) }\begin{array}{*{20}{c}} ,&{\,\,\,{\mathbf{Q}} = } \end{array}{\mathbf{D}}_{{\mathbf{B}},1}^{\left( j \right) }{\mathbf{D}}_{{\mathbf{B}},1}^{T\left( j \right) } \end{aligned}$$
(23)

Matrices P and Q are of the same size. We may supply this to the first term in Eq. 22:

$$\begin{aligned} Tr\left( {{{\mathbf{P}}^T}{\mathbf{P}}} \right)= & {} Tr\left( {{{\left( {{{\mathbf{D}}_{\mathbf{A}}}{\mathbf{D}}_{\mathbf{A}}^T} \right) }^T}\left( {{{\mathbf{D}}_{\mathbf{A}}}{\mathbf{D}}_{\mathbf{A}}^T} \right) } \right) \nonumber \\= & {} Tr\left( {{{\mathbf{D}}_{\mathbf{A}}}\underbrace{{\mathbf{D}}_{\mathbf{A}}^T{{\mathbf{D}}_{\mathbf{A}}}}_{\mathbf{I}}{\mathbf{D}}_{\mathbf{A}}^T} \right) = Tr\left( {{{\mathbf{D}}_{\mathbf{A}}}{\mathbf{D}}_{\mathbf{A}}^T} \right) \nonumber \\= & {} Tr\left( {{\mathbf{D}}_{\mathbf{A}}^T{{\mathbf{D}}_{\mathbf{A}}}} \right) = {R_{\mathbf{A}}}. \end{aligned}$$
(24)

Analogously, this holds for the third term in Eq. 22:

$$\begin{aligned} Tr\left( {{{\mathbf{Q}}^T}{\mathbf{Q}}} \right) = {R_{\mathbf{B}}} \end{aligned}$$
(25)

The second term in Eq. 22 can be expanded accordingly:

$$\begin{aligned} Tr\left( {{{\mathbf{P}}^T}{\mathbf{Q}}} \right)= & {} Tr\left( {{{\left( {{{\mathbf{D}}_{\mathbf{A}}}{\mathbf{D}}_{\mathbf{A}}^T} \right) }^T}{{\mathbf{D}}_{\mathbf{B}}}{\mathbf{D}}_{\mathbf{B}}^T} \right) \nonumber \\= & {} Tr\left( {{{\mathbf{D}}_{\mathbf{A}}}{\mathbf{D}}_{\mathbf{A}}^T{{\mathbf{D}}_{\mathbf{B}}}{\mathbf{D}}_{\mathbf{B}}^T} \right) = Tr\left( {{\mathbf{D}}_{\mathbf{A}}^T{{\mathbf{D}}_{\mathbf{B}}}{\mathbf{D}}_{\mathbf{B}}^T{{\mathbf{D}}_{\mathbf{A}}}} \right) \nonumber \\= & {} Tr\left( {{{\left( {\underbrace{{\mathbf{D}}_{\mathbf{B}}^T{{\mathbf{D}}_{\mathbf{A}}}}_{\mathbf{G}}} \right) }^T}\underbrace{{\mathbf{D}}_{\mathbf{B}}^T{{\mathbf{D}}_{\mathbf{A}}}}_{\mathbf{G}}} \right) = Tr\left( {{{\mathbf{G}}^T}{\mathbf{G}}} \right) . \end{aligned}$$
(26)

These transformations of three terms allows us to write Eq. 22 as:

$$\begin{aligned} D_{ch}^2\left( {{{\mathbf{A}}_{\left( j \right) }},{{\mathbf{B}}_{\left( j \right) }}} \right) = {R_{\mathbf{A}}} + {R_{\mathbf{B}}} - 2\,Tr\left( {{\mathbf{G}}_{\left( j \right) }^T{{\mathbf{G}}_{\left( j \right) }}} \right) \end{aligned}$$
(27)

where

$$\begin{aligned} {{\mathbf{G}}_{\left( j \right) }} = {\mathbf{D}}_{{\mathbf{B}},1}^{T\left( j \right) }{\mathbf{D}}_{{\mathbf{A}},1}^{\left( j \right) } \end{aligned}$$
(28)

Obtaining this allows us to significantly speed-up computations of chordal distance and related kernel, as Eq. 27 and Eq. 28 have significantly lower computational complexity than Eq. 22. This is because only \({{\mathbf{G}}_{\left( j \right) }}\) needs to be computed after carrying out SVD decompositions of j-th mode flattening of tensors \({{\mathcal {A}}}\) and \({{\mathcal {B}}}\), respectively. In order to obtain the chordal kernel, we need to repeat these computations L times (to account for the dimensionality of tensors).

This, combined with the input subsampling (subsampled kernel) presented in Eq. 16 makes the computational cost of tensor-based feature space spanning via kernel projections suitable for data stream scenarios.

4 Concept drift detection in tensor data streams

Having defined the used decision tree induction method, creation of a tensor-based feature space for it, and a proper kernel for computing similarities between tensors, we need a concept drift detector fitting this framework. Most of existing drift detectors are based on either statistical properties of new vectors arriving from the stream, or on the error of classifiers (Gama et al. 2014). The former ones cannot be directly used for tensor data, while the latter ones are criticized as in real-world scenario we do not have an instant oracle access to classifier error. Therefore, we propose a concept drift detection method based on tensor properties.

For drift detection, we need to keep a window of w most recent tensors and use them for comparison with the newest incoming tensor, to check if it still falls into the current concept. We propose to conduct drift detection using j-mode tensor flattening (see Eq. 2), computing it for all tensors in the window and the recently arrived one. Mean (\(\overline{{\mathbf{A}}}_{\left( j \right) } = \frac{1}{w} \sum _{i=1}^w {{\mathbf{A}}_{\left( j \right) }}^{(i)}\)) and standard deviation (\(\sigma ^2 = \frac{1}{w-1} \sum _{i=1}^w ({{\mathbf{A}}_{\left( j \right) }}^{(i)} - \overline{{\mathbf{A}}}_{\left( j \right) })^2\)) of these flattening matrices can be used to describe the current concept. When a new tensor arrives, its j-mode tensor flattening can be used to check how well it fits the current concept using a popular \(3 \sigma \) rule, allowing for a two-level signal output with alarm level and drift detection level:

$$\begin{aligned}&\Theta _{alarm} : \Vert {{\mathbf{A}}_{\left( j \right) }}^{(new)} - \overline{{\mathbf{A}}}_{\left( j \right) } \Vert \ge \sigma \end{aligned}$$
(29)
$$\begin{aligned}&\Theta _{drift} : \Vert {{\mathbf{A}}_{\left( j \right) }}^{(new)} - \overline{{\mathbf{A}}}_{\left( j \right) } \Vert \ge 3\sigma \end{aligned}$$
(30)

We use these two-level signals for two important actions in our tensor data stream classification framework.

4.1 Spanning new feature space after drift

As we use a subsample of tensors to span the kernel similarity-based feature space for decision tree induction, we must take into consideration that after the concept drift the current projection may no longer be representative. Therefore, we need to span a new projection, which imposes additional computational cost on the system. As it is not feasible to do this after every new tensor arrives, we propose to combine this with the drift detector. Whenever an alarm signal is being raised, we start collecting the incoming tensors in a temporal buffer. Then, when changes become significant and a drift alarm is being raised, we randomly subsample this buffer (see Eq. 16) and use it to span a new feature space with chordal kernel. This approach significantly reduces the number of times when we need to recompute similarity-based feature projections. The entire buffer (all stored tensors) will be then used to train a new decision tree in the newly spanned kernel feature space.

4.2 Updating the decision tree classifier after drift

As decision trees do not have in-built mechanisms for handling concept drift, they are combined with drift detector to guide their update. When concept drift occurs, it is less computationally expensive to discard the old classifier and build a new one on the current concept than to try to adapt the pre-existing tree structure. As after the alarm signal has been raised we already collect incoming tensors for calculating new feature space, we may use them as well for training a new decision tree. While the new feature space is created using only a subsample of tensors from the buffer, the new decision tree is trained using all of the stored tensors. When a drift alarm is being raised, we train a new decision tree using newly created kernel feature space, discard the old model, and replace it with the new decision tree. Then we may discard all the tensors stored in the temporal buffer, as they will not be needed.

5 Experimental study

In this section, we present the experimental evaluation of the proposed framework for tensor data stream classification with decision trees. We conduct two independent experiments: (i) on large-scale real-world tensor benchmarks that have streaming characteristics in order to examine the usability of the kernel feature space for training decision trees; (ii) on artificial datasets with injected specific type of concept drift in order to evaluate the scalability of the proposed framework to growing number of tensor factor dimensionality.

The experimental study was designed to answer the following research questions:

  • RQ1: Does the proposed chordal kernel-based decision tree is capable of more accurate classification of tensor data streams than the state-of-the-art reference methods?

  • RQ2: Does the proposed online kernel transformation of tensors and training decision trees in the kernel feature space do not impose prohibitive computational costs on the classification system?

  • RQ3: How does the proposed kernel-based decision tree handle increasing tensor dimensionality (number of factors)?

  • RQ4: Does the proposed method can efficiently handle various types of concept drifts present in tensor data streams?

5.1 Tensor benchmarks for data stream classification

We have selected four real-world tensor datasets that display streaming characteristics, as well as generated 52 artificial tensor datasets with injected specific types of concept drift. Their details are presented below and in Table 1.

Table 1 Details of used real-world and artificial tensor benchmarks

5.1.1 Chicago Crime (CC)

A collection of crime reports in the city of Chicago, ranging from January 1st, 2001 to December 11th, 2017. We split the original tensor into 5 000 000 separate small tensors, each representing a single crime. The classification task is to predict the crime based on remaining information from the report.

5.1.2 Yahoo Music (YM)

A collection of user ratings of music items in Yahoo! services. Concept drift is strongly embedded, as data reflects the changes in music distribution platforms and market needs. We subsampled the dimensionality each of individual factors to make it feasible for a single machine computation. Original task was to predict the user rating of an item. We discretized this task into class labels via average ranking values for items.

5.1.3 Street View House Numbers (SVHN)

A collection of 640 420 images representing house numbers, each digit displayed individually in a form of 32x32x3 RGB color image tensor. The classification task is digit recognition.

5.1.4 CIFAR-100 (C100)

A collection of 60 000 images, each stored as 32x32x3 RGB color image tensor. The task is to predict to which group target image belongs to.

5.1.5 SimTensor (ST)

An artificial tensor generator (Fanaee-T and Gama 2016) that we are using to evaluate the impact of different factor dimensionality in tensor data streams on decision tree induction. Each artificial benchmark consists of 2 000 000 tensors and each tensor factor has 100 values. We investigate tensor factor dimensionality \(\in [3;15]\). By combining it with MOA (Bifet et al. 2010) functionality, we are able to create four datasets with distinctive types of concept drift (none, incremental, gradual, sudden). SimTensor allows for streaming data generation with defined change points that served as drift injection moments. Each artificial tensor data stream is a two-class problem, with each tensor class generated from a distinct Gamma distributions. Class labels are assigned to each distribution generator, leading to a supervised learning problem and allowing for creation of 52 unique tensor data stream benchmarks.

5.2 Experimental set-up

5.2.1 Reference methods

As mentioned, up to our best knowledge this is the first work proposing usage of decision trees for classification of tensor data streams. As we propose a native tensor representation via chordal kernel (named Chordal Kernel Decision Tree, CKDT), as reference method we selected three state-of-the-art approaches for tensor vectorization that are able to work in an incremental fashion. We adapted them to this particular learning scenario. Online Robust Low-Rank Tensor Modeling for Streaming Data Analysis (LRTCR) (Li et al. 2019) uses the bilinear formulation of tensor nuclear norms and a stochastic optimization algorithm to learn the tensor low-rank structure alternatively for online updating. Online PCA with Optimal Regret (OPOR) (Nie et al. 2016) was proposed for low-dimensional data representation in online scenarios, thus can be used for tensors. Low-rank tensor decomposition (LRTD) (Guo et al. 2017) was developed for motion detection from videos using deep learning. In order to ensure a fair comparison, we train identical McDiarmid’s decision tree on these tensors representations, as we use for our kernel-based feature space spanning. Additionally, as none of these methods were developed for concept drift, we enhance them with our tensor-based drift detector. Furthermore, we present results for a standard McDiarmid’s decision tree (MDDT) that uses vector-based representation. This allows us to evaluate if operating in tensor space holds advantages over vector space for drifting data streams.

5.2.2 Parameters

For drift detection, we use a window \(w = 100\) tensors. For subsampling procedure during kernel feature space spanning, we use \(20\%\) of tensors stored in the window.

5.2.3 Evaluation metrics

We evaluate examined tensor classifiers according to their prequential classification accuracy (accumulative metric used in data streams) and prequential multi-class AUC (Wang and Minku 2020), model update time (in seconds) and memory usage (in RAM-hours).

5.3 Experiment 1: real-world tensor streams

In this experiment, we compare our proposed CKDT with three recent approaches for incremental tensor vectorization on four diverse real-world benchmarks that display streaming characteristics. We are interested in evaluating, if the proposed kernel feature space is more information-rich than vectorized spaces, which will translate into improved classification rates. Additionally, we wanted to evaluate the speed and memory consumption of analyzed approaches, in order to evaluate their usefulness for data stream scenarios.

Prequential accuracies and prequential multi-class AUC results are presented in Table 2, while Fig. 2 depicts streaming dependencies between prequential accuracy and number of processed tensors. Table 3 presents update time and memory consumption of analyzed models, while Table 4 presents the outcome of Shaffer multiple comparison statistical test of significance with \(\alpha =0.05\).

Table 2 Prequential accuracy (%) and prequential multi-class AUC (%) metrics for analyzed streaming-based tensor classification methods
Fig. 2
figure 2

Prequential accuracy over progressing real-world tensor data streams

Obtained results show that vector-based adaptive decision tree (MDDT) cannot handle drifting data streams with tensor representation. On the other hand, experiments highlight the high efficacy of the proposed CKDT framework. Our approach is able to achieve significantly better classification accuracies than the same decision tree model trained using state-of-the-art incremental tensor vectorization. This shows how information-rich is the tensor representation in the context of data stream classification and that it is highly beneficial to maintain it. Our kernel-based feature space is capable of capturing these valuable properties and translating them into a more effective decision tree induction. In analyzed real-world datasets (especially in CC and YM) we can observe significant drops in performance of each analyzed classifier. These moments stand for a severe drift presence that renders the entire system outdated and needs to be handled by a drift detector that will replace the decision tree. While all methods suffer from the presence of drift, we should notice that CKDT achieves faster recovery rates after changes and its performance does not drop as significantly as in reference approaches. This can be contributed to efficient spanning of the similarity-based feature space that leads to better handling of new concepts and quicker adaptation after the occurrence of concept drift (RQ1 answered).

Table 3 Average update time [s.] and memory consumption [RAM-hours] (calculated over 1000 tensors each) of analyzed decision tree approaches for tensor data stream classification
Table 4 Outcome of Shaffer post-hoc statistical test for comparison among CKDT and reference methods over multiple datasets (4 real-world and 52 benchmarks)

When analyzing the computational performance of CKDT, one should notice its shorter update time and lower memory consumption than these displayed by reference methods. This can be contributed to the computational tricks discussed in chordal kernel section, as well as to computing the new feature space only when concept drift took place. Additionally, we may observe that CKDT scales much better to bigger tensor representations (as seen with Chicago Crime and Yahoo Music datasets), outperforming significantly other algorithms (RQ2 answered).

5.4 Experiment 2: evaluating the impact of tensor dimensionality

In this experiment, we wanted to evaluate the scalability of our framework to high-dimensional input tensors (i.e., tensors containing a high number of factors). Most existing tensor datasets have between 3 to 5 factors (as they are either images, link relationships, or reviews), but we can predict that soon more complex tensor domains will become increasingly popular. Therefore, we used SimTensor generator to create a number of tensor data streams with varying number of associated factors. This also allowed us to inject concept drift in a controlled manner. Prequential accuracies are presented in Fig. 3.

Fig. 3
figure 3

Averaged prequential accuracy calculated over the artificial data streams with respect to increasing tensor dimensionality and different types of concept drift

Obtained results confirm our assumption that increasing number of factors will pose progressively increasing difficulty for all classifiers. We can see that CKDT shows much better scalability to high number of factors than other methods, as chordal distance allows for maintaining the tensor properties regardless of the input. The quality of feature spaces obtained by vectorization methods suffer in a much more significant manner, making their usage prohibitive in such cases (RQ3 answered).

It is interesting to analyze the interplay between the type of concept drift and the increasing number of factors. One can see that more complex tensors make proper drift detection much more difficult, leading to overall drops in accuracy. Most challenging type of drift is sudden one, which come to no surprise, as system has no time to react to it. Second most difficult drift is much more surprising, as incremental changes are usually easy to handle (Ditzler et al. 2015). In this case, we may attribute this learning difficulty to the way we designed our drift detector. If the change is small enough, the detection signal (\(3\sigma \) rule) will never be triggered, thus never reconstructing the feature space. We will continue our work in this direction, to propose more advanced tensor-based drift detector that is robust to such situations.

Overall, the proposed CKDT offers superior performance to all reference methods, even when they are enhanced with the proposed tensor-based drift detector. This can be contributed to the combination of the drift detection with kernel feature space that is more sensitive to changes in data distributions (RQ4 answered).

6 Conclusions and future works

6.1 Summary

In this paper, we have presented a first framework for tensor data stream classification with decision trees under concept drift. We have identified the drawback of existing data stream classification approaches, namely their limitation to vector representation of input data. We argued that as many real-world data sources generate multi-dimensional data that cannot be vectorized without a loss of information, there is a need for tensor-based classifiers for data streams. As a base classifier we selected McDiarmid’s incremental decision tree. In order to alleviate its limitations, we proposed to create a new feature space that operates on tensors and use it for decision tree induction. To this aim we employed kernel feature mapping, where a dedicated similarity measure using chordal distance was used. It allowed for calculating direct similarity between two tensors, without a need for vectorization. We showed how to speed-up the creation of the new feature space using random subsampling. We also proposed a concept drift detector based on tensor data representation that was used to control when to create a new feature space and when to update the classifier. Experimental study carried out on large-scale real-world and artificial tensor data streams showed that our framework preserves the information within tensors, leading to an excellent classification accuracy. Additionally, it scales-up to high-dimensional tensors and is much less computationally expensive than online vectorization.

6.2 Future works

In our future works, we plan to continue developing a holistic framework for tensor data stream classification that will encompass the following research directions:

  • Ensembles of CKDTs. A natural step forward will be to propose adaptive and online ensembles of Chordal Kernel Decision Trees to boost their predictive accuracy and make them competitive to modern deep learning algorithms (González et al. 2020).

  • Explainable learning from tensor streams. Decision tree structure offers a natural explainable and interpretable format (Sagi and Rokach 2020). This can be leveraged towards understanding the nature of changes in drifting tensor streams.

  • Speeding-up CKDTs. Current implementation of CKDT is efficient and faster than state-of-the-art methods, but can be further improved by using approximate decomposition approaches (Cyganek and Wozniak 2016).

  • Evolving tensor dimensionality. A fully robust framework for tensor data stream mining must offer the capability of adapting to evolving dimensionality and factors of tensors (da Silva Fernandes et al. 2019).