Multidimensional continuous time Bayesian network classifiers

The multidimensional classification of multivariate time series deals with the assignment of multiple classes to time‐ordered data described by a set of feature variables. Although this challenging task has received almost no attention in the literature, it is present in a wide variety of domains, such as medicine, finance or industry. The complexity of this problem lies in two nontrivial tasks, the learning with multivariate time series in continuous time and the simultaneous classification of multiple class variables that may show dependencies between them. These can be addressed with different strategies, but most of them may involve a difficult preprocessing of the data, high space and classification complexity or ignoring useful interclass dependencies. Additionally, no attention has been given to the development of new multidimensional classifiers of time series based on probabilistic graphical models, even though transparent models can facilitate further understanding of the domain. In this paper, a novel probabilistic graphical model is proposed, which is able to classify a discrete multivariate temporal sequence into multiple class variables while modeling their dependencies. This model extends continuous time Bayesian networks to the multidimensional classification problem, which are able to explicitly represent the behavior of time series that evolve over continuous time. Different methods for the learning of the parameters and structure of the model are presented, and numerical experiments on synthetic and real‐world data show encouraging results in terms of performance and learning time with respect to independent classifiers, the current alternative approach under the continuous time Bayesian network paradigm.


| INTRODUCTION
Many classification problems imply the analysis of trends or dynamics that occur in sequences of time-ordered data to perform accurate predictions. Typical examples are found in finance, medicine, signal processing or industry, but more applications are emerging in virtually any domain. [1][2][3][4] This article focuses on the complex scenario of multidimensional classification over multivariate time series, presenting a novel model for the task and a real-world problem where it is applied.
The multidimensional classification problem deals with the simultaneous classification of multiple class variables, that is, it requires the definition of a mapping function that determines the output of several multiclass class variables based on a given input data. This learning problem is included in the more general multioutput paradigm, which also covers supervised learning problems with outputs of different data types, such as real-valued or ordinal. The reader is referred to the comprehensive review of Xu et al. 5 for a more in-depth reading about the multioutput learning paradigm.
Traditional classification algorithms are limited to the prediction of a unique variable, so they cannot be directly applied in the studied multidimensional context. Two simple approaches are commonly used to avoid this limitation: the definition of a compound class variable that collects all combinations of class values (label powerset method) and the learning of independent classifiers for each class variable (binary relevance method). Nonetheless, both solutions involve a number of drawbacks, being a common inconvenience the impossibility of modeling the dependencies between class variables. This independence assumption can be avoided with methods such as chain classifiers, 6 which iteratively train a set of classifiers (one per class variable), whose feature spaces are extended with the ground-truth classes of their predecessors. However, this approach is really dependent on the order in which classifiers are applied, requiring to explore an intractable number of chain orders to find a suitable one. The performance of all these solutions could be improved by considering them along with frameworks such as that from Jia and Zhang, 7 which enriches our original data by extracting new features that encode information about the class variables.
The above methods transform the multidimensional problem into one or more onedimensional classifications. Rather, we will study the adaption of existing algorithms, which tackle the problem more directly without requiring this preprocessing. Learning algorithms such as decision trees, 8 support vector machines, 9 k-nearest neighbors, 10 or Bayesian networks classifiers 11 have already been extended to perform simultaneous classification of multiple class variables. However, some of these proposals focus on the multilabel classification subproblem, where all class variables are binary. 12 Apart from the multidimensional facet, the data studied in this article present a temporal dimension that cannot be ignored. To apply most of the above models over temporal data, it is common to follow a similar preprocessing strategy to that for static multidimensional data, called feature-based approach. 13 It transforms our original data set by extracting new feature variables that summarize the dynamics of time series during a time window. In this way, any traditional static classifier can be applied. Nonetheless, this implies the costly extraction of a probable large number of variables, which, in the end, may lose significant information. The adaptation of multidimensional classifiers to temporal data has not been extensively studied and few models can be directly applied in this context. Some of these algorithms include the multilabel k-nearest neighbors with dynamic time warping 10,14 and the long-short term memory recurrent neural networks. 15,16 Of these, the method of multilabel k-nearest neighbors stands out due to its simplicity and great performance. 17 Nonetheless, since it is a lazy classifier, it may imply a high space and classification complexity.
To the best of our knowledge, the direct application of probabilistic graphical models to the multidimensional classification of time series has received no attention, although they provide interesting characteristics such as an intuitive representation of variable dependencies. Dynamic Bayesian networks 18,19 are widely studied in the literature, but they only model discrete time, later extended to continuous time Bayesian networks. 20 For one-dimensional classification, the continuous time Bayesian network classifiers were introduced by Stella and Amer. 21 Here we extend them to the multidimensional problem, as well as different methods for its learning from data. We believe this is the first probabilistic graphical model able to perform time series multidimensional classification while explicitly modeling continuous time. More specifically, the main contributions of this study are the following: • A novel multidimensional continuous time Bayesian network classifier that is able to model discrete state multivariate time series data and classify it into multiple class variables. This proposal explicitly represents the temporal dynamics of feature variables in continuous time and seeks to improve its predictions by modeling the dependencies between class variables. • The introduction of algorithms for the learning of the parameters and structure of the presented model from data, as well as different structure constraints to adapt the model and its learning to the characteristics and demands of a certain problem. • A comprehensive comparative study with 50 synthetic and a real-world Industry 4.0 data sets to validate, under different conditions, the proposed model's effectiveness and its performance improvements with respect to continuous time Bayesian network classifiers using the binary relevance strategy. • The development of a software tool to allow the application of the presented model in other studies and the sampling of synthetic discrete state multivariate time series data sets with multiple class variables.
The remainder of this article is organized as follows. Section 2 reviews some fundamental concepts on which the presented model is based. Section 3 introduces the multidimensional continuous time Bayesian network classifier, while Sections 4 and 5 explain some methods to VILLA-BLANCO ET AL.
| 7841 learn its parameters and structure from data, respectively. Subsequently, Section 6 describes how the model performs multidimensional classification of unseen sequences. Section 7 presents synthetic and real-world experiments, and discusses the results that the proposed model obtains. Finally, Section 8 concludes the article and highlights future lines of research. For the sake of clarity, Table 1 provides the list of acronyms used in this study.

| FUNDAMENTALS
A Bayesian network (BN) is a probabilistic graphical model that encodes conditional independence assumptions over some random variables to obtain a factorized version of their joint probability distribution. These models have been widely used in a variety of domains [22][23][24][25] and their learning from data with different algorithms constitutes an important active research area. Some reasons for their success are the graphical representation of uncertainty, from which we could even learn causal relationships under certain conditions, that they can handle incomplete data, which are common in real-world problems, or that they can combine expert and data-extracted knowledge. 26,27 consists of a directed acyclic graph (DAG)  encoding conditional independence assumptions among the variables and a set of parameters B defining the statistical dependencies between them. This allows to factorize the joint probability distribution over  as where Pa X ( ) i are the parent variables of X i in , that is, variables pointing at X i in , and B X Pa X ( ) i i represents the conditional probability (discrete or continuous) distribution of X i given Pa X ( ) i .
The classical definition of a BN is limited to model static data, that is, samples that are assumed to be independent from each other, which is an inconvenience when data is dynamic. This article considers dynamic data with a temporal dimension, which is commonly known as time series data.
Definition 2 (Time series data set). A multivariate time series data set = { , …, } consists of multiple multivariate temporal sequences or trajectories . Each transition is represented by two pairs Dynamic Bayesian networks are the best known extension of BNs to model temporal data and they have been successfully used in real-world problems. [28][29][30] Nonetheless, they are based on discretizing time, forcing us to define a uniform time granularity even when the described processes evolve at different rates. Continuous time Bayesian networks (CTBNs) 20 Unlike a BN, the graph  of a CTBN can be cyclic since its arcs represent the dependencies between variables across time, that is, the state to which a variable will transition depends on the current state of other variables.
As with BNs, the interest in CTBNs is not solely motivated by knowledge discovery, but they can also be applied to the classification of unseen sequences. The family of classification models known as continuous time Bayesian network classifiers (CTBNCs) 21 are able to perform classification over the data introduced in Definition 4 by incorporating a new, time independent, class variable node to the CTBNs.

Definition 5 (Continuous time Bayesian network classifier). A CTBNC is a pair
and a time independent class variable C, with sample space , that is fully specified by the marginal probability P C ( ). The CTBNC graph has the same properties of that of a CTBN, but includes a class variable node with no parents, that is, Pa C ( ) = ∅.

| MULTIDIMENSIONAL CONTINUOUS TIME BAYESIAN NETWORK CLASSIFIER
The existence of multidimensional classification data provoked the appearance of multidimensional Bayesian network classifiers (MBCs), 11 but in static settings. However, there are real-world problems where we need to classify temporal sequences into multiple class variables, that is, a sequence . Take as an example the problem presented in Section 7.1.2, which requires to identify the power consumption state (discretized as high, low or inactive) of elements of an industrial machine based on the energy data it produces. Surely, several individual models (one per element) can be used to predict each of them. However, this would not identify inter-element dependencies, which could provide valuable, or even crucial, information to classify certain class variables. For example, it is known that some elements always work together, while others cannot be active at the same time. The multidimensional continuous time Bayesian network classifier (Multi-CTBNC) that we introduce here seeks to extend CTBNCs to this more complex scenario, allowing to model the interclass interactions by capturing the probabilistic relationships of class variables with a BN.  • A set of CIMs , one for each feature variable X i . • An initial distribution P 0  to represent the initial state of a sequence. As more than one class variable is present, P 0  is specified as an MBC.
As the class variables do not depend on time, a Multi-CTBNC is based on capturing their probabilistic relationships with BNs. Thus, this model can be decomposed into a BN and a CTBN, which divide  into three subgraphs: 1. The class subgraph = ( , )       : is defined by a BN that models the dependencies between class variables. This subgraph must be a DAG. 2. The feature subgraph = ( , )       : is defined by a CTBN that models the dependencies between feature variables over time and, therefore, cycles may appear. 3. The bridge subgraph = ( , )      : represents the dependencies of features on class variables and, therefore, it is defined by the same CTBN as the feature subgraph. It is also known as feature selection subgraph 11 since it specifies which feature variables are relevant for classification. Figure 1 shows a Multi-CTBNC graph and its constituent subgraphs. Note that the structure of this model shares many similarities with that of an MBC, despite the multiple differences in their underlying paradigms, but the Multi-CTBNC allows the appearance of cycles in its feature subgraph.

| PARAMETER LEARNING
The parameters of a Multi-CTBNC are those of a BN and a CTBN. As we are assuming all variables to be discrete, Multi-CTBNC nodes would contain either CPTs (for class variables) with parameters: • β c pa C ( ) j y : probability of class variable C y taking state c j given the parents' state pa C ( ) y . These are the parameters for the multinomial distribution over class variables' state; or CIMs (for feature variables), which are summarized by two types of parameters: . This is the parameter for the exponential distribution over the time a feature variable remains in a certain state.
transition is known to occur and the parents' state is pa X ( ) f . These are the parameters for the multinomial distribution over which state a feature variable will transition to. Therefore, the parameters for each class variable, of a Multi-CTBNC. To estimate these parameters, some sufficient statistics that summarize all observable data are recorded: : number of transitions of X i from state x j to any other when the parents' state is pa X Given the structure of a Multi-CTBNC, its parameters can be estimated with methods like maximum likelihood estimation or Bayesian estimation. The first approach assumes that the parameters are constants, seeking those values that maximize the probability of the observable data, that is, the maximum likelihood estimates. Parameters are then estimated as follows: 31,32 β N ,ˆ= , a n dˆ= .
In the case of the Bayesian estimation, the parameters are considered random variables and a prior distribution is defined over them. This is a more interesting approach since we can add prior expert knowledge and avoid the zero-count problem. 33 Conjugate prior distributions are defined for the two types of distributions used by the Multi-CTBNC, the multinomial and exponential distributions. The most common prior distribution for the multinomial parameters is the Dirichlet distribution, 34 while for the exponential parameter, the Gamma distribution is an appropriate choice. 32 As conjugate priors are used, the posterior distribution of the parameters given the observed data follows the same distribution and, therefore, it can be obtained analytically. Then, the parameters can be estimated using, for example, their expected values, in the same way as the maximum likelihood estimation, but including the hyperparameters of the Dirichlet prior distributions λ c These hyperparameters can be seen as imaginary counts of the sufficient statistics that occur before any data is observed, that is, is the number of transitions from state x j to any other different state and τ x pa X ( ) j i is the time X i remains in state x j , all before a data set  is considered. The hyperparameters can be defined by using expert knowledge and/or optimization techniques, 35 such as random search 36 or Bayesian optimization. 37

| STRUCTURE LEARNING
The problem of learning the structure of a CTBN has been traditionally approached as an optimization problem, 32,38,39 where a score assigned to structures is maximized. Therefore, this section adapts some common scores to the multidimensional classification problem with a Multi-CTBNC.
Score-based algorithms define a score to evaluate how well a model fits the observed data and a space of candidate structures the model can take. The components of these algorithms are: • An optimization algorithm: some examples are greedy hill climbing, tabu search, simulated annealing or genetic algorithms. 32,34,40,41 • A hypothesis space of structures: in the case of a Multi-CTBNC, the class subgraph must be acyclic and the bridge subgraph only contains arcs from class variables to features. The feature subgraph has no restrictions. • A score metric: these are commonly divided into those based on the likelihood function and the Bayesian scores. The next sections will introduce three different scores for learning the structure of Multi-CTBNCs.
VILLA-BLANCO ET AL.

| Log-likelihood
The simplest approach to learn the structure of a Multi-CTBNC is to maximize the likelihood of the observed data given the model,  P ( )   . The likelihood function of a Multi-CTBNC is obtained by incorporating the appearance and dependencies of class variables to the likelihood function of a CTBN: 32 The equation above shows that the likelihood for a Multi-CTBNC is decomposed into those for a BN (1a) and a CTBN (1b). This means that the learning of the class subgraph structure (BN) and the feature and bridge subgraphs (CTBN) of a Multi-CTBNC can be performed separately since they do not influence each other. In the case of a BN, the search space is limited to directed acyclic graphs, while the search space of a CTBN is simpler, since the graph can be cyclic. As the bridge subgraph encodes the dependencies of the features on the class variables, it is also defined during the learning of the CTBN. As explained above, its structure has to be restricted to only allow arcs from class variables to features. This is the same restriction found in a CTBNC, but here extended to more than one class variable.
In practice, instead of using the likelihood function, a better approach is to maximize the log-likelihood (LL), which is monotonically related. 31 The reason for this is that the LL is much easier to maximize and it helps to prevent underflows and overflows caused by the multiplication of small numbers and the exponential function: This score tends to overfit the data by favoring densely connected networks. Therefore, a penalization factor over the complexity of the network, that is, the number of parameters, can be included. Widely known penalization functions such as the Bayesian information criterion (BIC) 42 or the Akaike information criterion 43 can be applied over both the LLs of the BN and the CTBN. For example, the LL of a Multi-CTBNC with BIC penalization is: are the dimension (number of independent parameters) of the BN and CTBN, respectively, r C y is the number of possible instantiations of Pa C ( ) y and X X f f f is the number of possible transitions from each state of X f to any other.

| Conditional log-likelihood
When facing a multidimensional classification problem, the LL can be defined as: Equation (3) divides the LL into a first term representing the model's ability to classify a sequence and a second term describing the dependencies among feature variables. If the number of features is large, the LL is dominated by the second term, which may negatively impact the performance of the classifier. Consequently, Friedman et al. 44 proposed to only focus on the first term, which is the conditional log-likelihood (CLL), thereby following a discriminative approach. The CLL function was previously proposed by Codecasa and Stella 38 for the learning of CTBNCs, and here we extend it for Multi-CTBNCs.
The idea is to specialize the LL of the CTBN (Equation 2b) in the classification task since it defines the bridge subgraph and, therefore, the feature variables that are relevant for classification. The LL of the BN (Equation 2a) remains unchanged, as interclass dependencies may be relevant for the classification. This results in the following score: Equation (4) includes a normalization factor (denominator term) which is what differentiates its result from that given by the LL (Equation 3). Most notable differences with respect to the CLL for a CTBNC are found in the class probability and denominator terms, which take into account the multiple class variables and their dependencies. These are estimated, respectively, as follows: In the case of the likelihood term  P x x c ( , …, ) l t l t l T l 1 , the only difference is that parameters and sufficient statistics are defined based on the state of, potentially, multiple class variables that are parents of the features. Unfortunately, the denominator term cannot be further decomposed, as the logarithm is applied to a sum over all class configurations. Therefore, the log-sum-exp trick may be required in practice to prevent underflows and overflows. 33

| Bayesian Dirichlet equivalent score
This section presents a Bayesian score function for learning Multi-CTBNCs, the Bayesian Dirichlet equivalent (BDe) score, which was first presented for CTBNs by Nodelman et al. 32 Bayesian scores are defined as which are derived from the logarithm of the Bayes' rule to obtain the model structure with the highest probability given the data: They incorporate a prior probability over the model structures, P ( )  , which is maximized together with the marginal likelihood of the data given the structure,  P ( )   , which, in contrast to the likelihood, does not consider a specific assignment of the parameters. It rather adds uncertainty about them by integrating over all their possible values for : This has the advantage over the previous scores of hopefully reducing overfitting by evaluating over all possible values of the parameters.
Making the common assumptions of global and local parameter independence, 34 the logmarginal-likelihood of the data given the structure (i.e.,  P log ( )   ) can be decomposed into the sum of local log-marginal-likelihoods for each type of parameter and variable (LML): , Finally, if a uniform prior over the structures P ( )  is considered, the BDe score simply maximizes the log-marginal-likelihood of the observed data given a Multi-CTBNC: To further penalize complex structures, a nonuniform structure prior P ( )  , such as a Binomial prior distribution, 46 can be considered.

| Structure constraints
Due to the complexity to find and learn all possible Multi-CTBNCs, some assumptions can be made about the structure, that is, the hypothesis space can be reduced. Furthermore, these assumptions could help to learn models with better performance since they can prevent overfitting the data. Likewise for MBCs, a variety of Multi-CTBNC families can be proposed considering different search spaces for the class and feature subgraphs. For example, they can be limited to be empty, a tree, a forest, a polytree, a maxK (nodes have K parents at most), a DAG or, in the case of the feature subgraph, a directed graph (digraph).  Figure 2.
A well-known subclass of classifiers is the naive Bayes family, which assumes conditional independence between features given the class variables. In the case of the Multi-CTBNC, a fully naive model is an empty-empty Multi-CTBNC (see Figure 2A) with a complete bridge subgraph. The benefit of this model over independent continuous time naive Bayes classifiers (CTNBCs) 21 is that the number of parameters can be drastically reduced for certain data sets. Take as example the parameters that would be learned from a data set with eight ternary variables, three of them class variables. If three CTNBCs are built, the total number of independent parameters would be 276 since there would be 45 intensity matrices (three for each of the 15 feature nodes), each of them with six degrees of freedom, plus three CPTs with a degree of freedom of two. However, if there are exclusively five possible class configurations, a fully naive Multi-CTBNC would only require 156 parameters since the number of intensity matrices is reduced to 25 (five for each of the five feature nodes). Therefore, if the number of possible states of the class variables is large, but the number of class configurations is relatively small, a fully naive Multi-CTBNC would need a potentially much smaller number of parameters.
The *-maxK Multi-CTBNC family is of special interest since, for fixed K, the learning of a CTBN can be performed in polynomial time depending on the number of variables and data set size. 32 This is possible since the parent set of each CTBN feature can be optimized individually without having to worry about avoiding cycles in the resulting structure. Unfortunately, complexity for learning the bridge and feature subgraphs increases rapidly with the inclusion of more class variables. A *-maxK Multi-CTBNC does not limit the number of class variables that can be parents of the features, in the same way as a *-maxK MBC does, so the total number of parents could potentially be K d + for each feature. Evidently, this problem could be alleviated by imposing restrictions on the number of class variables. 48 Regarding the learning of the class subgraph, if a general DAG or even a max2 are considered, finding its optimal structure would be NP-hard due to the acyclicity constraint. 49 If the number of class variables is relatively large, a tree structure may be a better option since polynomial-time learning algorithms can be applied. 34 , whose transitions are fully observed, classification is performed by choosing the class configuration that maximizes the posterior probability, that is, the maximum a posteriori estimate of is the probability that feature variable X f stays in state x pf t j during the time given the class configuration c: is the probability that X f transitions from state x pf t j to state x pf t j+1 when it is known that a transition occurs and given the class configuration c: where ϵ is a small positive number. Finally, P c pa C ( | ( )) y y represents the probability of class variable C y taking state c y given the parent's state pa C ( ) y : P c pa C β ( | ( )) = .
y y c pa C ( ) y y Therefore, the predicted class configuration c* for a sequence p  is As it happened before, the logarithm of the argmax argument in Equation (7) is used instead for convenience.
If estimating the a posteriori probabilities of each class configuration is needed, the marginal likelihood of the sequence p  must be computed, that is, the denominator term of the posterior probability (Equation 6) cannot be ignored: Unfortunately, as shown in Equation (5), this term cannot be further decomposed.

| EXPERIMENTS
This section will empirically compare the performance of the Multi-CTBNC with multiple independent CTBNCs for the multidimensional classification of synthetic and real-world time series data. The assessment will be performed with several performance evaluation metrics estimated with a fivefold cross-validation scheme to guarantee an honest and fair comparison. The learning of the model structures will be done by hill-climbing optimization, using the aforementioned scores and an empty initial structure. Parameters will be learned with Bayesian estimation using hyperparameters λ α , which were found to provide interesting results. All experiments were run on an Intel Core i7-7700K at 4.20GHz with 32 GB of RAM using Windows 10 operating system. The Multi-CTBNC and CTBNC were developed in Java and the software and data sets are freely available at https://github.com/carlvilla/Multi-CTBNCs.

| Data sets
The proposed model will be first evaluated over randomly generated synthetic data. Then, a real-world data set from an industrial machine (from now on referred to as energy data set) is used to prove the usefulness of the model in a real-world scenario.

| Synthetic data sets
Fifty synthetic time series data sets were randomly generated from the structures of Figures 1A and 3A-I (five data sets per structure). The objective is to compare the performance of the Multi-CTBNC with independent CTBNCs when the data is obtained under diverse conditions. The class variables (C C C C , , , 4 , and C 5 ) are assumed ternary, while the feature variables (X X X X , , , 4 , and X 5 ) can take four values, so they have three possible transitions from a certain state.
Structures from which synthetic data sets were generated The data sets are sampled via probabilistic logic sampling 51 from random CPTs and CIMs. To sample a sequence, a class configuration obtained from a BN and an initial observation are first defined for the sequence. From this we can sample the time that each feature variable will remain in its current state. Therefore, this time gives the order in which the transitions of the features will occur, obtaining a new observation for the sequence after the transition of only one of them. The new state to which the feature variable transitions is sampled by taking into account the current state of its parent features, as well as the classes of the parent class variables. Once the transition is done, a duration time for the new state of the feature is sampled and the above process will be carried on until a sequence of a predefined duration is obtained. In our case, each data set contains 10,000 sequences that last a bit more than 10 time units (an average of 164 transitions).

| Energy data set
The energy data set contains electrical measurements extracted in collaboration with a partner company from an industrial machine working in a real environment. These variables include intensity (I), voltage (V), active power (P), reactive power (Q), and apparent power (S), which were observed at a sampling frequency of 500 Hz and discretized into 30 states with an equal width discretization. The industrial machine is composed of different three-phase motors and, therefore, for each energy variable a measure in each of the three phases (A, B, and C) was obtained. In total, the data set has 15 feature variables.
The task to perform with the energy data set is to classify the power consumption state (high, low, or inactive) of six motors (M1, M2, M3, M4, M5, and M6), which constitute six class variables, by using the energy consumption of the machine as a whole. It is important to note that these motors are related to each other, as M1 and M2, as well as M5 and M6, work together on very similar tasks. At the same time, M3 and M4 work synchronously with the motor pairs M1/M2 and M5/M6, respectively.
As in a real application we cannot obtain sequences where the motors have a unique state for all transitions, the extracted training sequences have a fixed duration of 0.3 s, determined based on the needs of the company, and the consumption state assigned for the motors is the one that occurs the most.

| Performance evaluation metrics
Evaluation metrics for multidimensional classifiers should consider, simultaneously, the performance of the model on multiple class variables. Although the literature on this topic is limited, several evaluation metrics have been already proposed for this context: • Global accuracy: 47 ratio of sequences that were correctly classified for all class variables, that is, a partially correct classification is considered as an error: where c′ l and c l are the predicted and actual classes of sequence l, respectively, and δ ( , ) ⋅ ⋅ is the Kronecker's delta function, so δ c c ( ′, ) = 1 l l if c c ′ = l l and 0 otherwise. • Mean accuracy: 47 mean of the accuracies obtained for each class variable separately: where Acc y is the accuracy for class variable C y . • Global Brier score: 52 it measures the accuracy of probabilistic classifiers by considering the probability that assigns to multidimensional predictions: is the space of joint configurations of the class variables. • F 1 score: harmonic mean of the precision (P) and recall (R) on a class c j : where tp fp , c c j j , and fn c j are the counts for true positives, false positives and false negatives, respectively, for class c j .
Traditional equations for precision, recall and, therefore, F 1 score can only be used for a unique binary class variable. However, Gil-Begue et al. 53 extended them for multiple, possibly nonbinary, class variables. Let B be a function that computes any of these evaluation metrics by receiving a confusion matrix, then the metric scores are obtained with macro-and microaveraging as follows: • Macro-averaging: averages the scores of each class variable: Only the macro-averaging approach will be considered in our experiments to calculate the F 1 score, as the micro-averaging is equivalent to the mean accuracy when the cardinality of all class variables is the same and greater than two. As a false positive for a certain class is, at same time, a false negative for another when a multiclass class variable C y is considered, that is, j , both the micro-averaged precision and recall for C y are equivalent and, therefore, equal to the micro-averaged F 1 score. Given that the micro-averaged precision and the accuracy for C y can be computed as follows: then we can infer that the mean accuracy of multiclass class variables with the same cardinality is equivalent to the micro-averaged F 1 score: Note that the normalization factors   1 Ω C y ∕ cancel each other out since the cardinality is assumed to be the same for each class variable. Otherwise, this equality would not hold.

| Synthetic data set
Average results obtained from fivefold cross-validations over the five data sets generated from each structure are shown in Tables 2-4. Each table includes the performance of the models when they are learned with different scores. The mean and standard deviation for each evaluation metric are reported and the best results are written in bold.
Interesting results have been obtained in all experiments except with the CLL score (Table 3). This score did not lead to the expected models, at least in the performed experiments, since most learned structures have an empty bridge subgraph even when no penalization on their complexity is applied. The reason is the difference between the likelihood and denominator terms, as the latter tends to be larger (and therefore the score smaller) with the inclusion of dependencies VILLA-BLANCO ET AL.                               in this subgraph. The CLL score then favors very sparse or even empty structures. Consequently, the following analysis will focus on the results achieved with the BIC and BDe scores.
As we can see in Tables 2 and 4, the Multi-CTBNC outperforms the independent CTBNCs in all synthetic experiments using both the BIC and BDe scores. As averages are susceptible to outliers, the Wilcoxon signed-rank test was applied over all the results of the 50 data sets to verify that there are statistical differences in the performance of the Multi-CTBNC and the CTBNCs. This results in p-values smaller than 0.001 for both scores in all evaluation metrics. Therefore, the differences are highly significant and the null hypothesis that both methods perform equally well is rejected in favor of our new model. The Multi-CTBNC has an important advantage in those contexts where class variables have a very weak or nonexistent relationship with the features that we were able to collect. This can be seen in some class variables from Figures 3A,B,E,F. The reason for this is that the Multi-CTBNC can model the dependencies of those class variables with others, while the CTBNCs may only rely on their prior probabilities.
The influence of the scores on the model performance was also studied and significant differences were found. If we compare the results of the Multi-CTBNC in Tables 2 and 4, a slight improvement is generally appreciated with the BDe score. This was verified for all evaluation metrics with the Wilcoxon signed-rank test and a significance level of 0.005. We observed in our experiments that this score tends to be more robust to data overfitting and to better reconstruct the original structures. As an example, Figure 4 shows how the BIC creates a slightly denser structure from a data set of Figure 1A, while the BDe reports a more accurate result.
Results obtained with the data sets from Figure 3H are of special interest since there are no dependencies among the class variables and a more even result between the independent classifiers and the Multi-CTBNC was expected. However, the Multi-CTBNC significantly improves all metrics while, in most cases, correctly does not define any association between class variables. The fact of knowing the simultaneous dependencies of the features on different class variables allows the Multi-CTBNC to learn more accurate models. For this same reason, even an empty-digraph Multi-CTBNC may obtain better results than independent CTBNCs. Table 5 shows the performance of an empty-digraph Multi-CTBNC on the synthetic data sets, highlighting in bold the results that improve those of the CTBNCs (Table 4). Significant improvements were found in all evaluation metrics, except the macro-averaged F 1 score, with the Wilcoxon signed-rank test and a significance level of 0.001. This justifies the use of Multi-CTBNCs even when it is clear that there are no dependencies between the class variables.
Finally, Table 6 shows the learning times for the Multi-CTBNC and the CTBNCs when they are built with the BIC and BDe scores. The learning of a Multi-CTBNC is considerably faster than multiple CTBNCs when using the proposed data sets. Additionally, the BIC is significantly more time consuming since it tends to build denser structures (see Figure 4). These differences are statistically Due to the high cardinality of the energy data set feature variables, their nodes are limited to have at most one feature as a parent. Therefore, this section compares the differences in performance between independent max1 CTBNCs and a DAG-max1 Multi-CTBNC. Fifteen data sets with different sequence order were extracted from the original energy data set and a fivefold cross-validation was performed on each of them. The average results of the cross-validations are shown in Table 7, where the DAG-max1 Multi-CTBNC outperforms the independent classifiers in all evaluation metrics. In this occasion, the most interesting models were obtained with the LL penalized with BIC, unlike with the synthetic data sets. This is because the BDe defines empty bridge subgraphs, so the classification is performed without T A B L E 5 Estimated evaluation metrics (mean ± std. deviation) over the synthetic data sets when learning an empty-digraph Multi-CTBNC with the BDe score

Data sets
Global accuracy Mean accuracy Global Brier score Macro F 1 score  considering the feature variables. Wilcoxon signed rank tests show significant differences for all evaluation metrics with a significance level of 0.001. Figure 5 shows an example of a Multi-CTBNC structure learned on the energy data set. Unsurprisingly, all class variables have the same six features as children since they all represent similar motors. The remaining energy features are not shown as the model does not consider them for classification. The differences of the class variables lie in their influence on the total consumption of the machine and their relationships with each other. Regarding the latter, the results are very accurate since most dependencies match the description of Section 7.1.2. In this problem, the differential factor that makes the Multi-CTBNC perform better than independent CTBNCs is its ability to model these dependencies. Expected relationships among feature variables are also detected since they involve measurements over the three different phases of the electrical system. The reason for this is that the current and voltage are balanced, with the same magnitudes and 120 degrees displacement, between each phase. Nonetheless, this balance could be affected in terms of magnitude and/or phase angular displacement by the type of the electrical consumers (e.g., single phase elements) or anomalous behaviors (e.g., failure of electronic components, such as, phase diode, fuse or isolation). The advantage of using a graphical model is that these machine malfunctions could be easily detected by analyzing the learned structure.

| CONCLUSIONS AND FUTURE RESEARCH
This article introduces the Multi-CTBNC, a new probabilistic graphical model that is able to capture the dependencies of multivariate temporal sequences and perform multidimensional classification over them. The learning of its parameters and structure from data is discussed, and its usefulness is justified by solving a novel real-world engineering problem.
Score-based learning algorithms for Multi-CTBNCs were studied, as it is the common approach for learning CTBNs. However, a constraint-based algorithm for CTBNs has been recently proposed by Bregoli et al. 54 where it was experimentally shown to perform better in certain scenarios. Therefore, its extension to learn classification models and, more specifically, the use of its fundamentals to develop a constraint-based algorithm for Multi-CTBNCs are possible lines of research.
The adaptation of Multi-CTBNCs to handle other types of predictive problems where data and sources have different characteristics is currently being studied. First, an important limitation of the presented model is that the feature variables are assumed to be discrete, while the problem of discretizing time series still requires more attention. 39 Nonetheless, we would like to study a possible adaptation of Multi-CTBNCs (and therefore CTBNs) for continuous F I G U R E 5 Structure of a DAG-max1 Multi-CTBNC learned from the energy data set. CTBNC, continuous time Bayesian network classifier; DAG, directed acyclic graph variables, thus avoiding this discretization. Second, this study assumes no missing data or hidden variables, that is, complete data assumption, which in practice is not always the case. Third, the classifier can easily use other distributions to model the waiting times of variables in a certain state. This is the case of hypoexponential distributions, previously used with CTBNs, 55 which are more appropriate than the exponential for certain applications.
The study of structure constraints for CTBNs has not received a lot of attention and, to the best of our knowledge, only the learning of naive Bayes and structures where nodes have a maximum number of parents were considered. It could be interesting to analyze the possible benefits in terms of computational complexity and performance of learning CTBNs with other constraints, such as being trees or DAGs.
Finally, although a really limited number of class variables is commonly assumed, multiple applications stand out for having a large number of them. For example, the industrial problem introduced in this article could require classifying many more motors. Therefore, we plan to study the inclusion of feature selection approaches to the Multi-CTBNC to make classification under these conditions less prohibitive. This would allow to include restrictions over the complexity of the bridge subgraph. With a similar objective, it would be interesting to design class-bridge decomposable 47 Multi-CTBNCs to reduce the computational cost of estimating the most probable class configuration of a sequence.