Multivariable times series classification through an interpretable representation

Multivariate time series classification is a task with increasing importance due to the proliferation of new problems in various fields (economy, health, energy, transport, crops, etc.) where a large number of information sources are available. Direct extrapolation of methods that traditionally worked in univariate environments cannot frequently be applied to obtain the best results in multivariate problems. This is mainly due to the inability of these methods to capture the relationships between the different variables that conform a multivariate time series. The multivariate proposals published to date offer competitive results but are hard to interpret. In this paper we propose a time series classification method that considers an alternative representation of time series through a set of descriptive features taking into account the relationships between the different variables of a multivariate time series. We have applied traditional classification algorithms obtaining interpretable and competitive results.


Introduction
Nowadays, large amounts of data are generated. Everything is increasingly interconnected, more and more sensors are included in everything around us, and these monitor the behavior of any event of interest over time. These sensors generate lots of data in the form of multivariate time series (MTS). A key task in the analysis and mining of these data is multivariate time series classification (MTSC), which aims to give an accurate response to a large number of problems: e.g. from detecting when a patient is sick or has an anomaly in his heart behavior [23], or if a driver is in optimal condition to drive [22], the recognition of human activities [29] or how to adapt energy production based on particular circumstances [20].
The field of MTSC can be divided into two main types of work. Firstly, applied works that seek to obtain a better solution for a given problem, offering ad-hoc proposals considering the peculiarities of the treated problem [9][24] [21]. Secondly, proposals that deal with MTS in a general way but taking into account possible interrelations between the different variables available [26][1] [14] [35]. The proposals in the later group are usually based on strong theoretical foundations. A relatively large number of proposals for MTSC can be found in the literature [28] [6][16] [13]. Most of them are guided towards obtaining increasing levels of accuracy. However, eXplainable Artificial Intelligence (XAI) [11] is a topic enjoying a growing level of interest. Its goal is to build accurate intelligent system for complex tasks, but also paying special attention to their interpretability. The built systems or the way they make decisions are required to be easy to understand for human beings. Thus high accuracy is no longer the only objective, interpretability receives higher attention. This also applies to solutions for classification problems. In the field of MTSC, there are few proposals that pay attention to the interpretability of results [15]. Given the complexity of the problem, most proposals are focused on obtaining the best results in terms of accuracy. Even the proposals based on shapelets [34] [3], which are interpretable from their univariate origins, have chosen to use the Transformed Shapelets in multivariate environments [8] or proposals that are even less interpretable [17], giving priority to accuracy results over interpretability. One possible way to pave de path towards easier to understand solutions to MTSC is expressing time series in different domains. Perhaps in terms of descriptive features instead of the raw time domain values.
In this paper, we present a new MTSC approach based on the representation of time series through a set of features and measures. This approach allows transforming the original MTSC problem into a traditional classification problem, enabling to apply the whole set of the traditional classification algorithms. While focused on raising the interpretability of the classification results, the approach allows to obtain competitive accuracy results with respect to the main techniques of the state-of-the-art.
The remainder of this paper is organized as follows: Section 2 introduces the state of the art in MTSC. Section 3 describes in depth our proposal. Section 4 shows the experimental study conducted, the results obtained, and the interpretability of them. Finally, Section 5 concludes the paper.

Related work
In the field of MTSC, proposals from methods that have demonstrated good behaviour in univariate cases predominate. Some of the first proposals for MTSC were multivariate extensions of the distance-based algorithm 1NN-DTW [32] [19], given its simplicity and good results in univariate environments. These proposals are a good starting point but they carry the limitations they had in univariate environments: high computational complexity and low interpretability, since they only indicate how much the instances are similar to each other. To these limitations we must add that in a multivariable environment the first proposals of 1NN-DTW processed each variable of each time series independently, so they were not able to extract information from the relationship between the different variables that make up each multivariate time series. With this in mind, we can say that multiple proposals for a multivariate DTW have been made, such as dependent (DTW D ) and independent (DTW I ) warping, both having the same performance [30]. Other proposals such as Mahalanobis Distance-based Dynamic Time Warping measure (MDDTW) [25] seek to give a general answer to this problem. MDDTW is able to precisely calculate the relationship between the different variables that compose an MTS, this together with the alignment obtained by DTW allows to obtain very competitive results.
The feature-based approach has multiple proposals, giving special importance to the extraction of additional information and to the speed of processing, especially when compared to similarity-based techniques. In this field we can differentiate between proposals based on shapelets and bag-of-words. In the field of shapelets, Generalized Random Shapelets Forests (gRSF) [18] is considered the state-of-the-art, obtaining better results than its direct competitor, Ultra Fast Shapelets (UFS) [33]. gRSF is based on the creation of a set of shapelet-based decision trees from a random extraction of the shapelets. In the field of bag-of-words, Word ExtrAction for time Series cLassification plus Multivariate Unsupervised Symbols and dErivatives (WEASEL+MUSE) [28] is considered the state-of-the-art, as it obtains the best results against its direct competitors: Learned Pattern Similarity (LPS) [7], Autoregressive forests for multivariate time series modelling (mv-ARF) [31], Symbolic representation for Multivariate Time Series classification (SMTS) [6], and gRSF. All of them have been tested on one of the first reference MTS database collected from [5], with a total of 20 MTSC datasets. WEASEL+MUSE extracts a vector of features by applying a sliding-window to each variable of the MTSC and filtering out non-discriminative features, finally a classifier analyses these data.
In the field of deep learning, the extension of the Long Short Term Memory Fully Convolutional Network (LSTM-FCN) and Attention LSTM-FCN (ALSTM-FCN) [16] to a multivariate environment, including a squeeze-and-excitation block in the fully convolutional block that improves accuracy. This proposal improved the WEASEL+MUSE results over the original database of 20 datasets [5] extended with 10 datasets from the UC Irvine Machine Learning Repository (UCI) [12] and 6 datasets used by Pei et al. [27] A new proposal has recently emerged, Local Cascade Ensemble for Multivariate Data Classification (LCE) and its extension for Multivariate Time Series (LCEM) [13]. LCE and LCEM are a hybrid ensemble method with 2 major objectives. The first one is to handle the bias-variance tradeoff by an explicit boosting-bagging approach. The second one is to individualize classifier errors on different parts of the training data by an implicit divide-and-conquer approach. This proposal is outlined as the new state-of-the-art in MTSC by obtaining better results than the previous state-of-the-art MLSTM-FCN and WEASEL+MUSE, on the University of East Anglia (UEA) repository [2], a new repository for MTSC composed of 30 datasets that is becoming increasingly important.
In contrast to state-of-the-art methods, we propose a method that obtains essential features of each variable and each MTS and applies a transformation to the MTS dataset obtaining a traditional classification problem based on attributes. All traditional classification algorithms can be applied to this new dataset, and depending on the applied algorithms, interpretable results can be obtained to explain the problem or results of higher accuracy.

Multivariable times series classification through an interpretable representation
In this work we propose a method that allows the calculation of complexity measures to be applied to MTSC problems. Our proposal, namely Complexity Measures and Features for Multivariate Time Series (CMFMTS), is based on the idea that a time series can be faithfully represented with a set of complexity measures and descriptive features [4]. Furthermore, these features preserve most of the information content of the series to such an extend that they can be used to classify the series.
The following is an example of the calculation of some features on an MTS with three variables. In Table 1, we show some features highly related to the nature of the time series and its range of possible values. In Figure 1, we show a simple example of the feature extraction used and its interpretability. In the first place, we can see how variables 1 and 2 are similar, so we can expect values of the features also similar to each other. This is reflected in the values of kurtosis and skewness. If variables 1 and 2 have similar values their probability distribution will be similar and therefore their values of kurtosis and skewness. We can appreciate a significant difference concerning variable 3. In the three variables, we can see the existence of a single trend, for this reason, the trend values are close to 1 in all cases. The oscillations present in the variables 1 and 2 seem more typical of seasonal patterns that do not affect the trend of the time series. To evaluate the Chao-Shen shannon entropy (shannon entropy cs) we have to appreciate the evolution of variables 1, 2, and 3. Variables 1 and 2 show a certain pattern, while variable 3 shows a long period without changes with a final reduction of the value never seen before. For this reason, it obtains a higher value in shannon entropy very far from the one obtained by variables 1 and 2. We also analyze the values of curvature and linearity. Given the evolution and shape of the three variables and the perceptible linear relationship between the current values of variables 1 and 2 with their corresponding past values, it is logical to expect positive and similar values of curvature and linearity for variables 1 and 2. On the other hand, variable 3 does not show these forms or a linear relationship between its present and past values, so it obtains negative values that are far from curvature and linearity concerning what is obtained by variables 1 and 2. Figure 2 shows the workflow of our proposal. First, a set of n multivariate time series is assumed, each consisting of m variables (Figure 1.1). Individually, each one of the variables that compose each time series is processed, obtaining the j features for each variable (Figure 1.2). A dataset is obtained with n × m

Char. Name
Description Range   Although the feature calculation based approach can be applied to all types of automatic learning problems as well as supervised, unsupervised, semi-supervised learning, etc. In the supervised case, simple and fast comparisons can be made with respect to the main state-of-the-art algorithms. Due to the great variety of the processed time series, it is possible that undesired values are obtained for some of the proposed features. For example, to calculate the autocorrelation coefficient function (ACF) [4] concerning the values delayed 10 instants of time it is necessary that our time series has a minimum length of 11, otherwise, we will obtain an Not Available (NA). Time series with a single value is another problematic case since features like kurtosis and skewness are not defined for these cases and would return Not a Number (NaN) values. Time series containing NA generate problems internally in some of the features used (acf, kurtosis, skewness, shannon entropy cs, etc.) returning NA values in those features. Finally, there are features that can obtain values in the range (−∞, ∞). Extreme values close to the limits are considered as undesirable since they generate several problems in the different algorithms applied later. To deal with these cases, we have specified a preprocessing stage, following the calculation of the features and their correct ordering, which solves the possible inconveniences generated by these cases. The whole process is depicted in Algorithm 1. end if 10: end for 11: for each column in (mvf train, mvf test) do 12: for each value in column do 13: if is.na(value) then value ← mean(column) end if 14: end for 15: end for 16: for each column in mvf train do 17: if (length(unique(column)) <= 1) then 18: mvf train.delete(column.index) 19: mvf test.delete(column.index) 20: end if 21: end for 22: output data ← NULL 23: for each model in models do 24: fit ← train.model(mvf train, train.Ts class) 25: pred ← fit.predict(mvf test) 26: acc ← accuracy(pred, test.Ts class) 27: output data.add(fit, pred, acc) 28: end for 29: return (output data, mvf train, mvf test)

Algorithm 1 Preprocessing procedure
The starting point is the calculation of the proposed features in the training and test sets (Line 1). For any of the cases mentioned above in which an undesired value has been obtained, these values are unified under a single NA identifier (Lines 2-4). We check on the training set if any column lacks interest because it is full of undesired values. If so, this column is removed from both the training set and the test set (Lines 5-10). In order to simplify the treatment of missing values, we have chosen to impute these values with the average of their respective column (Lines 11-15). There are better imputation techniques, but we do not address that task in this paper and the considered one has proved to ve effective enough. To avoid the use of variables without information, we analyzed the training set looking for variables with a single value. If any variable with this condition is found, it is eliminated from both the training set and the test set (Lines [16][17][18][19][20][21]. Finally, each of the specified models is processed, obtaining the desired model fit, its prediction on the test set and the accuracy achieved (Lines 23-28). These data are returned to the user, together with the datasets transformed to the features of our proposal (Line 29).
Finally, once we have explained our proposal and its application in a real environment, we can list the main advantages offered by this approach: • Allows the use of the application of any vector-based classification method, since after the applied transformation, we obtain a traditional dataset where each instance is represented by its corresponding attributes (features).
• Allows the use of machine learning methods based on different paradigms: supervised, semi-supervised, self-supervised, unsupervised, etc., since it obtains a vector-based dataset, where each instance is composed of different attributes.
• Handles easily datasets of time series with varying lengths, as it processes each time series individually.
• Decisions made can be easily understood by human experts, since the features used explain the behavior of the time series. In addition, the represented concepts by the selected features are interpretable by the users.

Empirical Study
To assess the performance in classification tasks of the proposal a thorough empirical study has been designed and carried out. We start by explaining the experimental design (Section 4.1), followed by the results obtained (Section 4.2). Finally, we analyze the interpretability of the models obtained (Section 4.3).

Experimental Design
We describe the performance measures used to evaluate our proposal (Section 4.1.1), followed by the datasets used (Section 4.1.2) and the machine learning models selected (Section 4.1.3). Finally, we describe the hardware and software used in the development of our proposal (Section 4.1.4).

Performance measures
Since the datasets come from very different fields, we have opted for a ranking performance measure. We have selected the average rank as a comparative method, from the calculation of the accuracy on the original training and test sets. The accuracy has been calculated as the number of instances correctly classified divided the total number of instances of the test set. Also, we have included the Win/Loss/Tie ratio to quantify the number of cases in which each model and approach wins, loses, or ties concerning the best case. Since the range of possible results is wide, we have opted to include a Critical Difference diagram (CD) [10]. CD shows the results of statistical comparison between all models in pairs based on average ranks. Models that are connected by a bold line do not have a statistically significant difference, for a particular confidence level. In our case, we have set an α of 0.05, for a 95% confidence level. The average rank and the CD were obtained using the R scmamp package

Datasets
To evaluate the performance of our proposal on problems of all kinds, we have selected the main repository of MTSC problems, the UEA multivariate time series classification archive. In Table 2

Models
For our proposal, we have selected a set of traditional models with two main approaches: to obtain interpretable results and to obtain the best classification results by sacrificing interpretability [4]. These models are C5.0 with boosting (C5.0B), RandomForest (RF), Support Vector Machine (SVM), and 1-Nearest Neighbors with Euclidean Distance (1NN-ED). For this last model we have applied a normalization between [0, 1]. This set of models will be applied to the set of time series features obtained by our proposal. The final models of our proposal are obtained from the union of the transformed datasets with the four models previously commented. These proposals are: CMFMTS+C5B, CMFMTS+RF, CMFMTS+SVM, and CMFMTS+1NN-ED. We have simplified the CMFMTS nomenclature by CMFM due to space limitations in later tables. On the other hand, we have selected the main state-of-the-art MTSC models: • 1-Nearest Neighbor classifier with Euclidean distance (1NN-ED), with and without normalization.
• 1-Nearest Neighbor classifier based on the sum of DTW distance for each dimension (DTW-1NN-I) [30], with and without normalization.
• Local Cascade Ensemble for Multivariate data classification (LCEM) [13], optimized hyperparameters for each dataset (Windows (%), Trees, and Depth). The results have been obtained from the published work of the authors.
• Random Forest for Multivariate (RFM) algorithm, from the sklearn library, applied to the transformation proposed in the LCEM paper [13].
• Extreme Gradient Boosting for multivariate (XGBM), Extreme Gradient Boosting algorithm, from the xgboost library, applied to the transformation proposed in the LCEM paper [13].
The results of the algorithms mentioned above have been obtained from [13].

Hardware and Software
The experimentation carried out in this work was performed in a server with the following hardware: 4 Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz processors, 8 cores per processor with HyperThreading, 10 TB HDD, 512 GB RAM. The server software configuration comprises Ubuntu 18.04 and R 3.4.4.
The source code of our proposal can be found in the online repository 1 .

Results
We start by analyzing the accuracy and the average rank results on the 30 processed datasets. Table 3 shows the accuracy results obtained by our proposal against the main state-of-the-art algorithms. The NA values refer to cases in which for any reason (memory overflow, limitation of libraries, etc.) a model has not been obtained correctly, and it has been impossible to perform the desired classification. As we see in Table 3, our proposal CMFMTS+RF, called CMFM+RF for simplification, obtains the best results among the four models we have proposed: CMFMTS+C5.0B, CMFMTS+RF, CMFMTS+SVM, and CMFMTS+1NN-ED. The CMFMTS+C5.0B model is especially interesting for cases where a simple classifier is required, easy to interpret, and that offers results close to the optimum ones, as it happens in the datasets: Charactert-Trajectories, LSST, and PEMSF. We can find cases in which CMFMTS+RF does not offer the best results among these four models, and it may be interesting to try other combinations as it happens in the datasets: AtrialFibrilation, EthanolConcentration, FaceDetection, FingerMovements, etc.
If we compare our proposal with the rest of the state-of-the-art algorithms we see how CMFMTS+RF obtains an average rank of 6.88, close to the one obtained by LCEM (4.1), RFM (5.15), MLSTM-FCN (5.3), and WEASEL+MUSE (5.62). We have included two decimals for the average rank so that the differences shown in Figure 2 can be better appreciated. If we observe the Win/Loss/Tie ratio, we can see that MLSTM-FCN obtains the best results in 11 datasets, followed by LCEM which wins in 9 datasets, and CMFMTS+RF and RFM, which obtains the best results in 5 datasets. These behaviors are reflected in the CD diagram shown in Figure 2. This diagram shows that there is no statistically significant difference, for an α of 0.05, between the previously mentioned models, in addition to the DTW-1NN-D model. This indicates that our CMFMTS+RF proposal offers results that are statically indistinguishable from those obtained by the main state-of-the-art algorithms. Analyzing the median and average values of accuracy our proposal obtains competitive results. The NA values have been transformed to 0 for the calculation of these last two measurements. Analyzing the results obtained for some specific cases, we can appreciate significant differences between the different proposals. For example, in the Duck-DuckGeese dataset, the MLSTM-FCN algorithm obtains 7.5 points of difference with the next best result. LCEM and similar proposals obtain results with significant differences concerning the rest of the methods, as can be seen in the HandMovementDirection dataset. In the Ering dataset, we see a big difference between our CMFMTS+Any proposals and the rest of the algorithms. These cases confirm the idea that in the field of CMTS the results are strongly linked to the data itself and the approach used. It is especially complicated to find an approach able to face all kinds of problems with optimal results, or close to them.
The difference in results between the traditional TSC approach and our proposal is largely due to the nature of the underlying problem. In general, it can be seen that the features extracted from the time series summarize behavior at the general level of the time series. Although the classifiers used are capable of finding relationships of interest between the features of the different variables, there are cases in which the class differentiator may fall entirely on specific patterns not reflected in those features. In these cases traditional MTSC approaches such as 1NN-DTW can directly find such patterns.

Analysis and interpretability of results
The interpretability of the final results is strongly related to the model used, for example, the C5.0B model offers us a simple decision tree based on the features used, although as we saw in Table 3 its accuracy results are not the best. On the other hand, a RF offers competitive results in exchange for sacrificing part of their interpretability, although RF is able to offer an assessment of the importance of each feature in the final model that can be very useful. In contrast, models such as 1NN-ED lack interpretability, since they work on how much one instance resembles another, and SVMs are really complex to interpret since weights can be affected by external components unrelated to the underlying importance of each variable. Since tree-based models offer different interpretability tools, we will focus on them in this section.
In Figure 3, we show two examples of a single C5.0B trees for the Basic-Motions and SelfRegulationSCP1 datasets. BasicMotions is a dataset with 4 classes and MTS with 6 variables. As we see in Figure 3a, our approach allows us to solve this problem with a simple tree composed of 3 nodes. Two of these nodes refer to characteristics of variable 6 and the remaining one refers to variable 1. Although the results obtained for this dataset are not competitive, it is remarkable how you can obtain acceptable results with such a simple decision tree. The dataset SelfRegulationSCP1 is a binary MTS problem with 6 variables. In Figure 3b, we see how 5 nodes have been necessary, 4 refer to variable 1, and the remaining to variable 2. In this case, the results obtained, although they are not the best, are close to the most competitive models. In both cases, although we have 6 variables per problem, we can see that the created C5.0B models focus on using information from only 2 variables. This fact offers us a slight confirmation about a typical phenomenon of MTS, and that is that not all variables of an MTS have the same importance.
In the case of the RF we have the Mean Decrease Gini Importance, which indicates the importance of each feature in a specific dataset. This information can be used to improve the learning process and better understand the problem. For this reason, we perform three different analyses that allow us to extract the desired information:

Feature importance by dataset
To analyze in a simple way the importance of the features used, we have chosen a graphic approach. In this way, we also penalize the importance of any features that could not be calculated in any variable. Finally, we normalize these last values between 0 and 1 for each dataset.
In Figure 4, we see significant differences among the different datasets. First, we see datasets with solutions dominated by 1 to 4 features with high importance (BasicMotions, DuckDuckGeese, Heartbeat, InsectWingBeat, NATOPS, PEMS-SF, SelfRegulationSCP1, and SpokenArabicDigits). In these cases, we have two options: the set of features used is sufficiently expressive to address the problem satisfactorily with competitive results (RF: Heartbeat, In-sectWingBeat, and PEMS-SF) or the selected features are not sufficient and other approaches achieve significantly better results (MLSTM-FCN: BasicMotions, DuckDuckGeese, NATOPS, SelfRegulationSCP1, and SpokenArabicDigits).
Secondly, we can identify cases where all the features are necessary (Atri-alFibrillation, EthanolConcentration, FingerMovements, HandMovementDirection, Handwriting, LSST, MotorImagery, and PhonemeSpectra). We think there are two possible explanations for this situation. Firstly, in some cases not even the whole set of 41 features provide enough information to produce an accurate classifier. For this reason, the classifier assigns similar importance to a large number of features and is not able to obtain the best results (AtrialFibrillation, EthanolConcentration, FingerMovements, HandMovementDirection, Handwriting, and MotorImagery). Secondly, complex cases in which it is not possible to find a reduced subset of features capable of explaining the problem. In these cases, more complex solutions are obtained, with a high number of features, capable of offering the best results (LSST and PhonemeSpectra, for this last dataset we obtain results very close to the best ones).
These behaviors and the differences mentioned above can be grasped clearly from Figure 5, where the datasets have been ordered by the accumulated importance of the 41 features. If we look from bottom to top, the cases from highest to lowest accumulated importance, we see how from HandMovementDirection to ArticularyWordRecognition, from 15 datasets in only 1 case, LSST, our proposal featured-based achieves the best result. On the other hand, if we look from SelfRegulationSCP1 to Epilepsy, we see that our proposal obtains the best results in 4 out of 15 cases. So we can infer that in cases where great importance is given to a high number of features, this approach does not lead to the best results.

Accumulated feature importance over a large set of datasets
Another particularly interesting analysis to be carried out is related to the importance at the feature level. In Figure 6, we show the average importance of each feature throughout all the datasets. These values have been obtained from the results shown in Figures 4 and 5. The average value of the importance of each feature has been calculated over the 30 datasets processed. We can see that there is a group of distinguished features, namely, curvature, linearity, and shannon entropy cs. This group has values of importance far superior to the rest. There is also a second group with high values of importance but far from the highest values: shannon entropy sg, spectral entropy, trend, unitroot kpss, unitroot pp, x acf1, entropy, e acf1, and spike. All of them are assigned average importance values much higher than the rest of features, greater than 0.4. Even further, it is interesting to realize the features related to the complexity of a time series get the highest importance values. Higher values achieved by features such as trend, x acf1, and e acf1 confirm that the components of the time series are very descriptive and useful when extracting information from them. Other features such as curvature, linearity, and spike, shown as characteristic behaviors of the time series, are especially useful in describing them. On the other hand, there are also features with particularly low values of importance. In this case, the time series have been treated as traditional data vectors, for that reason features such as nperiods and seasonal periods have an importance values of 0. In the case that the best results are sought and a detailed analysis of the time series is carried out, in which data on seasonality are available for each time series, these measures can be very useful. In the UEA repository, the vast majority of datasets are composed of MTS of equal length, for this reason, the length feature is not of high importance. Fur-thermore, features such as kurtosis and skewness have obtained low average importance values, although they are particularly explanatory. If we look at Figures 4 and 5, we find some datasets like FaceDetection, CharacterTrajectories, SpokenArabicDigits, and InsectWingbeat in which these features have obtained high importance values. In this case, even if we identify features that are generally not interesting, they may be relevant for specific problems. These cases reinforce the idea that the selection of a representative set of features must be supported by theoretical knowledge about the structure of time series and by different analyses of results performed on large sets of datasets.

Variable importance
Finally, we analyze an important point in the field of MTSC, the existence of components or variables that contain a greater part of the information on the problem. To do this, we have calculated the sum of the 41 features for each variable of the problem in question. Then we have rescaled these values by dividing by the maximum value of each case. In this way, the maximum importance value of any variable, in any case, will be 1. The range of possible importance values is [0,1]. In Table 4, we show the statistics of interest on the values obtained.
If we relate these values to the cases discussed above we see that, for example, in the case of the PhonemeSpectra dataset all 11 variables are of similar importance and CMFMTS+RF was close to the best results obtained, this means that all variables contain information of interest. In LSST we see that although the 6 variables contain information of interest, some of these variables stand out from the rest. To study in a simple way the distribution of the accumulated importance values, we have chosen to include, Figure 7, the histograms of these values for some cases of interest. For the BasicMotions dataset, in Figure 7a, we see 2 variables with much higher importance than the remainder together with a third variable that also stands out. These variables are, in decreasing order of importance: 6, 2, 1, 4, 3, and 5. If we compare these values of importance given by RF with the tree C5.0B of classification shown in Figure 3a, we see how this tree is composed by 3 nodes, 2 of these nodes have features of the variable 6 and the remaining node has a feature of the variable 1, what matches with the variables with more importance given by RF. For the dataset SelfRegulation-SCP1, Figure 7b, we can see similar behavior. We see a variable significantly separated from the rest and the rest of the variables are close to each other, in terms of importance. These variables are, in decreasing order of importance: 1, 2, 3, 4, 5, and 6. If we make the previous comparison with regard to the tree C5.0B of Figure 3b, we see how the tree is composed of 5 nodes, 4 of these nodes have a feature of the variable 1, more important variable according to the RF. While the remaining node has a feature of variable 2, the second most important variable according to the RF. These examples show a certain relation between the variables with more importance according to the RF and those used by a simple classifier such as C5.0B. In the case of NATOPS dataset where only some of the 24 variables contain the most relevant information. In Figure 7c, we can see that it is more difficult to obtain well-differentiated groups of variables according to their importance. In this case, we can see that the 3 variables with the greatest importance are significantly distanced from the remainder, with importance values higher than 0.75. Depending on the information sought and the difficulty of the problem, we could decide to lower the threshold, create different groups of variables, etc. These histograms are especially interesting for datasets with a large number of variables. For example, the dataset PEMS-SF, Figure 7d, where only some of the 963 variables have a high importance, giving residual importance to the rest. In this case, there are only 5 variables with an accumulated importance value higher than 0.75. These variables are, in decreasing order of importance: 212, 187, 55, 172, and 604. In order to evaluate the relative importance of each variable within a problem, we show graphically the proportion of importance by each variable of the most representative and easily visualized cases in Figure 8.
By looking at Figure 8, we can determine, for selected cases, the variables that have the greatest potential, so that efforts can be directed at improving the information recorded on those variables or processing them with greater attention. For the ArticularyWordRecognition dataset, we see that variables 1, 4, 7, and 9 are of less importance. ERing datasets shows a great importance  accumulated in variables 1 and 4. Epilepsy dataset has much of its useful information in variables 1 and 2. EthanolConcentration and Handwriting show similar behavior. The HandMovementDirection dataset shows how variables 8, 9, and 10 have greater cumulative importance. In the case of SpokenArabicDigits the most important variables (1, 2, 3, 4, and 8) can be clearly differentiated from the rest (5, 6, 7, 9, 10, 11, 12, and 13). For the LSST dataset, we see that variables 3, 4, and 5 have of higher weight. JapaneseVowels dataset shows that variables 8, 9, and 13 have of greater importance. With these results, the preprocessing of the data can be modified in such a way as to improve the recording of the data of these variables or to give them greater importance in the learning process.

Conclusion
In this paper we have presented a method that allows to apply a featurebased approach designed for univariate time series classification problems to MTSC problems, namely CMFMTS. This method enables the use of traditional classification algorithms on MTSC problems, considerably expanding the tools available to deal with this type of problem. Furthermore, this allows to achieve interpretable results in a field where the solutions obtained are characterized by their complexity, which is directly related to the number of variables of the problem.
We have published the software of our proposal so it can be used freely. Also, we have published all the information to make our work fully reproducible. Our proposal has been evaluated on 30 datasets from different fields, obtained from the UEA repository. We have focused on tree-based algorithms because of their high interpretability, comparing their results with the main state-of-the-art algorithms of MTSC.
CMFMTS offers very competitive accuracy results in comparison with the main state-of-the-art algorithms. Using the CD diagram, we see that there is no significant statistical difference, for an α of 0.05, between state-of-the-art algorithms and our proposal best case, CMFMTS+RF. We can conclude that these methods work equally well from a statistical point of view.
The interpretability of the results obtained is a significant advantage of our proposal compared to other methods of the state-of-the-art. CMFMTS allows us to relate representative characteristics of the time series with the classification made, depending on the algorithm used for modeling the problem. The trees obtained by the C5.0B algorithm are a clear example of this. Even less interpretable algorithms than the classification trees offer all kinds of valuable information. The use of the RF algorithm and the Mean Decrease Gini Impor-tance as a measure of evaluation of the features used has allowed us to identify which features have a high potential of information in each problem, revealing very different behavior between different datasets. This has allowed to identify the most valuable features for each case, relating each dataset with the behaviors of interest associated with those features.
We have verified the existence of a set of features that maintains high importance throughout different datasets. However, there are also certain cases where other features that are less important on average offer the best results. These features are usually related to characteristic behaviors of the time series. This fact reinforces the idea that the selected features must be supported by a strong theoretical foundation in the field of time series and not be selected only through a purely experimental approach. In addition, the nature of the problem will define which features collect its information in the best possible way.
Finally, based on the accumulated importance per variable of our proposal, we have corroborated that, in a significant part of the processed datasets, not all the variables of an MTS have the same importance when it comes to solving a problem. In a significant part of the processed datasets, we have identified between 1 and 4 variables that contained most of the information. This phenomenon is typical of MTS. These results have a direct impact on future classification processes, which can benefit significantly from the additional information generated.