An algorithm to improve the performance of convolutional neural networks for TSC tasks

The convolutional neural network (CNN) is an important network model for processing time series classification tasks. However, there can be a lot of noise in the data itself. The clarity of spatial feature expression of time‐series samples and the freeness and balance of data samples both have a big effect on how well CNNs can handle time series classification (TSC) tasks. This paper proposes a multi‐pose marking learning algorithm (MPML), which optimizes the spatial representation of data samples through multiple graphical representations of time series data and then solves the problems of freeness and heterogeneity of data samples through labeling and preferential learning of poorly trained sample data. The algorithm improves the model's robustness to noise by improving the spatial expression sentiment of the time series and the labeling learning of the feature data, thereby improving the convolutional neural network's performance to perform the TSC task. We show how well the algorithm works by testing it on 46 datasets from the UCR Time Series Classification Archive and comparing the results.

on the classification results. 6 It has weight sharing and can maintain translation invariance when learning time series, which is a huge improvement over multilayer perception machines because the timestamp of each time series in the multilayer perceptron. 7 Almost all research on time series classification models in recent years has been inseparable from convolutional neural networks. The convolutional neural network has the advantage of being good at extracting spatial features, which is the main reason why it is widely used in various time series classification tasks. Also, the convolutional neural network can directly learn the characteristics of time series data, which reduces the complexity of feature engineering and the impact of these features on the classification results. 6 It has weight sharing and can maintain translation invariance when learning time series, which is a huge improvement over multilayer perception machines because the timestamp of each time series in the multilayer perceptroeparable from the presence of convolutional neural networks. Reference 8 proposes a strong baseline model for time series classification based on convolutional neural networks, and reference 9 proposes a TSC network model based on LSTM and CNN, which is validated on almost all UCR datasets. Reference 10 reviewed the main baseline models that have emerged in the field of time series classification over the past 5 years, almost every one of which is inseparable from convolutional neural networks. However, researchers mostly focus on the optimization of the neural network model itself and ignore the problems of the data samples themselves. There are usually two problems in time series data sets: First, the spatial characteristics of time series data samples are not clearly expressed, as shown in Figure 1. When a time series dataset is constructed based on a real scene, due to the different selection of sequence length, a different selection of feature data, and different feature preprocessing methods, multiple different time series can be generated. They all describe the same real phenomenon, so how to construct time series data with clearer spatial features? It is a question that requires extensive validation and tests patience.
Second, when applying time series to describe real-world scenarios, the collected data samples may have problems of high data sample detachment and unbalanced data sample categories. As shown in Figure 2, some samples may be outside the regular sample statistical area, and some samples may describe the same data classification as other popular samples with a smaller number of samples, which often reduces the prediction accuracy.
The above problems seriously affect the performance of CNN in processing TSC tasks. The problem that the spatial characteristics of time series data samples are not clearly expressed usually requires more complex feature preprocessing to solve. In reference 11, when classifying network traffic data, each data sample is folded and deformed to represent a two-dimensional matrix, and the network traffic data is regarded as a picture, which improves the dimension of network data and then uses the advantages of convolutional neural networks that are good at processing graph data to classify network traffic. Reference 12 proposed the use of Gramian Angular Summation (MAS) and Markov Transition Fields (MTF) to convert time series into images, extract high-level features in time series, and improve the time clarity of sequence spatial representation. It is not difficult to see that the graphical representation of time series is an important way to solve the unclear expression of data space. But the samples that are too free are often processed by denoising, and the problem of unbalanced data sub-classification can often only be solved by more cycles of model training, which may cause more serious overfitting. F I G U R E 1 Different time series describing the same phenomenon F I G U R E 2 Schematic diagram of data samples with high freeness and unbalanced samples F I G U R E 3 Schematic diagram of the multi-pose expression labeling algorithm Based on the above problems, this paper attempts to solve the bottleneck of the CNN model through a simple algorithm. As shown in Figure 3, this paper calculates the multiple pose expressions of the time series through enumeration and represents the time series as a set of pictures from different angles, thereby improving the spatial expression clarity of the dataset. Then, the problem of data sample detachment and imbalance is alleviated by labeling and preferential learning of poorly trained sample data. Using 46 data sets from the UCR Time Series Classification Archive to compare and test this algorithm shows that it can improve the performance of convolutional neural networks when doing TSC tasks.
Essentially, our algorithm is an ensemble method. It obtains the best classification results by converting a time series dataset into multiple different graphical representations in a multiplicative decomposition and then training a model independently using each form of data representation. For time series datasets with varying feature length, both the decomposition form and the number of generating graphs are inconsistent. This multi-pose representation helps to mine the spatial structure of the time series. Compared to focal loss, which optimizes for the imbalance of the samples or adjusts the weight of the labels, our approach is a more general one. We do not consider the specificity of a particular dataset, and the test against 46 datasets shows a general applicability.
The rest of this paper is structured as follows: Section 2 describes the multi-pose representation of time series data, Section 3 describes the sample-marking algorithm, Section 4 describes the experiment and results, and Section 5 summarizes and forecasts.

Related definitions
Referring to the description of time series in the literature, 11,13 the time series and time series dataset is defined as follows: is an ordered set of actual observations about the real world, the sequence length is T, with x i ∈ Rx i ∈ R X .
, X i is the time series feature data, and y i is the natural number label starting from 0 corresponding to X i .

Multi-pose representation of time series data
The simplest time-series visualization is to convert the shape of the original time series (1, T) into the shape of (1, H, W) to better utilize the advantages of convolutional neural networks in the field of processing image data. 1 is the grayscale of the image, T is the length of the time series, H is the height of the data, and W is the width of the data. As shown in Figure 4, we decompose the time series length T by multiplicative decomposition T = H × W. This decomposition will have multiple decomposition methods. We use an enumeration method to obtain a set of all the folded shapes. This set is denoted by M = {(H 1 W 1 ) , (H 2 W 2 ) , … , (H m W m )}. In this case, the minimum values of H and M are restricted; otherwise, the convolution process might not be possible. In addition, in some cases, the number of multiplicative decompositions of T is small, and even the decomposition operation of T = H × W cannot be performed, F I G U R E 4 Schematic diagram of multi-pose representation of time series at this point, we can perform a zero-fill operation on the original time series, and fill in several zeros at the tail of the time series feature data, so that it can be decomposed in various forms, this is a padding-like operation that will not negatively affect the classification performance of the data. We use x to represent the padding length, at this time After the original time series X = [x 0 , x 1 , … , x T−1 ] is image-folded, a two-dimensional vector will be generated, at this time, X can be expressed as: x i,j represents the jth data of the ith row of the two-dimensional vector, and after X is flattened, it can be expressed as: . Algorithm 1 shows the process of obtaining M H m W m )}. The minimum values of H and W depend on the parameter settings of the convolutional neural network. Generally, the minimum value should not be less than 5, because when the minimum value is less than 5, the design of the network will be greatly restricted. add(h, w) to M 6. end 7. end 8. return M Algorithm 1 describes the enumeration process of the above-mentioned multiplicative decomposition of the length of time series. The original data set D raw is represented graphically according to the shape in M, and a set of data sets with all shapes D = {D 1 , D 2 , … , D m }, Each dataset in D contains a side of the spatial characteristics of the time series, and their combination will give a more comprehensive description of the spatial characteristics of the time series, which is very beneficial to TSC. This is like taking pictures of a person from different angles, showing her full pose. This scheme is conducive to obtaining the best spatial representation of time series data, and is very simple and intuitive.

MARKING LEARNING ALGORITHM
In the traditional machine learning process, to solve the problem that the data samples are too free, the classification performance of typical samples is often improved by denoising, but this is not always correct, because many of these free samples exist objectively, which affects the recall of samples. In addition, a large number of datasets also have the problem of unbalanced sample data. The problem of data imbalance can generally only be solved by more cycles of data training, but this may cause more serious overfitting. To solve the above problems, this paper proposes a mark-learning algorithm. In each iteration of learning the model, the training set will be divided into two parts: the training subset and F I G U R E 5 Marking learning algorithm flowchart the test subset. After each epoch of net training, the current net is used to make predictions for each test subset sample, and the predicted labels are compared with the true labels. The dataset will be relabeled during this process, and the poorly predicted training instances will be preferentially learned in the next iteration. This will help to correct previous mistakes, which is a targeted and precise learning method that avoids the overfitting problem caused by more training cycles while mitigating sample detachment and imbalance. This is a targeted and precise way to learn that avoids the problem of overfitting caused by more training cycles. This is based on the idea that fewer training cycles will reduce sample disconnection and imbalance. NET i (i ∈ 0to Epochs − 1) represents the state of the network after i training epochs, NET −1 represents the untrained network status. After the 0th round of training, NET −1 will be marked as NET 0 . As shown in Figure 5, at the beginning of each round of training, the marking learning algorithm divides the training set into two parts: test subset and training subset, through a certain proportion of random sampling, the split ratio is p ∶ 1 − p(0 < p < 1). The training subset is used as the input of the network model NET i−1 for network training, at this time, a marked data sample set is also used for model training, when the model state is NET −1 , the marked data sample set is an empty set. After data training, the network NET i−1 will be transformed into NET i , and then the test subset is used as the input of NET i for classification prediction. The marking learning algorithm will compare the label predicted with the real label of the sample. The poorly predicted data will be marked with 2 marks, which means that this sample will be preferentially learned in the next 2 rounds of training. The sample will lose 1 mark after the next round of training, and all marks will be lost after 2 rounds of training, but it may also be remarked in subsequent learning.
Algorithm 2 describes the above training process, but the difference from the above flow chart is that the validation part of the model is added to the algorithm. Although the marking learning algorithm avoids overfitting to a certain extent compared to the repeated learning of all training data, there is still the risk of overfitting, so model verification becomes necessary. Lines 7-16 of the code describe the verification process. We use a special verification dataset for model verification. In this way, the run method of the network becomes 2 parameters, run (D V , D train ). At the end of each training cycle, NET i are uesd to predict the validation set D V , and evaluate and record the prediction results. This helps us get the best-trained model to prevent overfitting. ( to results 17. end 18. return results We use accuracy to evaluate the performance of the model. Accuracy measures the closeness of the predicted label to the true label, which is mathematically described as In these formulas, TP represents the sample with positive predicted label and positive true label, FN denotes samples with negative predicted labels and positive true labels, FP represents the sample with positive predicted label and negative true label, TN represents the sample with negative predicted label and negative true label.

EXPERIMENTS AND RESULTS
In this section, we demonstrate the effectiveness of the multi-pose marking learning algorithm through comparative experiments to improve the performance of convolutional neural networks in handling TSC tasks. Since this is an effectiveness verification experiment, we did not design the network model carefully but adopted a simple convolutional neural network for algorithm validity verification.

Experimental setup
This paper uses 46 data sets from UCR to evaluate the effectiveness of the algorithm comprehensively. The maximum feature length of the dataset is 2709, the minimum feature-length is 60, the average feature length is 493, the maximum number of classifications is 60, and the minimum number of classifications is 493. The average number of classifications is 8.5, and the detailed data is shown in Table 1.
As with the experimental settings of other baseline algorithms, [8][9][10]14,15 we conduct experiments using the same experimental architecture for all datasets, and the network architectures only differ in the number of output categories for different datasets. Table 1 shows the statistical description of the 46 data sets used in our study, with the shortest time series feature length of 60 and the longest of 2709, with an average length of 492.96. Algorithm 1 feature decomposition leads to a different set of poses for time series of different lengths. Table 2 shows the architectures of the models used in the experiments. In order to obtain more reliable conclusions, three different network architectures were used for the experiments. Panel A, Panel B, and Panel C in Table 2 show the architectures of the convolutional neural networks CNN1, CNN2, and CNN3, respectively. The output of the convolutional layer applies the IDENTITY activation function f (x) = x, which directly passes the feature data to the next layer. In front of the output layer, there is a fully connected layer. After applying the linear rectification function (RELU) to the output of the fully connected layer, the output is classified by the SOFTMAX function, and the classification result is a one-hot vector of length K. K is the number of classifications in the dataset, which can be different for each dataset. The relevant function is shown in math as follows.

Evaluation and analysis
In this section, we first verify the effectiveness of the multi-pose expression and then the effectiveness of the marking learning algorithm for improving the performance of the CNN model.

Effectiveness analysis of multi-pose
Compared with the single-pose expression used in the literature, 11 the multi-pose representation provides more observation angles for time series data. Just as we observe a person from different perspectives, it will be beneficial to mine the spatial rules contained in the time series, which are easy to understand. As described above, different time series are transformed into multiple poses according to Algorithm 1. The resulting set of poses calculated for different time series lengths is different. And we consider the poses (A, B) and (B, A) to be completely different. The former denotes the graph consisting of A subsequences of feature length B, and the latter denotes the graph consisting of B subsequences of feature length A. They are two different viewing angles of the data, and practice has proven this. Table 3 counts the prediction accuracies of four datasets with different poses using different network architectures. It is easy to see that the prediction accuracy of the same dataset under different shapes is quite different. The maximum accuracy difference between the best and the worst pose of the dataset Beef is up to 25% when using CNN1. And the difference in prediction accuracy of CNN1 using pose (10, 47) and pose (47, 10) is up to 10%. It can be seen that poses (A, B) and (B, A) are completely different. If we choose only one shape for data input, it has a great chance to achieve the best result, while multi-pose expression provides a way to obtain the best prediction performance.
In summary, we collapse the time series and perform various graphical representations by enumeration. Thus, the spatial structure of the feature data is mined from different perspectives, from which the best representation of the feature data is found. Experiments prove that the accuracy rate is extremely different for different poses, which illustrates the effectiveness of our algorithm.

Performance analysis of marking learning algorithm
Under the premise of using multi-pose representation and the same test environment with the above network settings, we conducted comparative experiments on 46 UCR datasets to examine the performance of the Marking Learning Algorithm (MLA). We used three network models: CNN1, CNN2, and CNN3 on each dataset for accuracy prediction. Each model was experimented with twice on each dataset, the first time without the MLA and the second time with the MLA algorithm to verify the effectiveness of the algorithm. Table 4 shows the detailed statistics of the experimental data, and the delta column in the table is the difference between the data metrics after using MLA and before using it. The statistics in Table 4 show that for 46 datasets, the average improvement in prediction accuracy with the MLA algorithm is 5.08% for CNN1, 4.14% for CNN2, and 4.00% for CNN3. The 46 datasets have an average improvement in prediction accuracy of 4.41% in the comparison experiments of the three models, which shows their significant validity.
As shown in Table 5, When MLA was used in experiments with CNN1, 39 out of 46 datasets had better prediction accuracy. When experiments were conducted using CNN2 and CNN3, 35 and 34 of the 46 data sets, respectively, showed improved prediction accuracy when the MLA was employed. In the 138 comparison experiments done on the 46 datasets, 108 improvements in performance were found, with an average accuracy improvement of 4.41%. This shows that the algorithm can be used in many different situations.  Table 6 shows the network convergence period before and after using the marking learning algorithm for the top 10 datasets. Experimental results show that the network model can converge faster after the marking learning algorithm, which indicates that the marking learning algorithm improves the efficiency of the sample learning.

TA B L E 3 Prediction accuracy statistics of some datasets under different shapes
It is not difficult to see that the multi-pose expression can better mine the spatial features of time series, and the marking learning algorithm can improve the learning performance of convolutional neural networks. The MPML can improve the performance of convolutional neural networks for TSC tasks.

Comparison of MPML model and random forest algorithm
We have demonstrated the necessity of multi-pose representation and the significant effectiveness of the marking learning algorithm for improving CNN performance. We have also used the prediction results of MPML in Table 4 to compare them with those of the random forest model and XGBoost. Both of these algorithms are recognized as strong classification models. Table 7 shows the prediction accuracy of the MPML-based CNN model compared with that of the random forest and XGBoost models. On 46 datasets, the average accuracy of the MPML algorithm is 84%, the average accuracy of the TA B L E 4 Performance comparison statistics of CCN before and after using MLA  Random Forest algorithm is 78.83%, and the average accuracy of XGBoost is 72.51%. The average accuracy of MPML is 5.18% points higher than that of the Random Forest algorithm and 11.49% points higher than that of the XGBoost algorithm.
In conclusion, the multi-pose marking learning algorithm proposed in this paper is remarkably effective and universally applicable and can improve the performance of convolutional neural networks in processing TSC tasks.

CONCLUSION AND OUTLOOK
Aiming at the problems of unclear spatial representation of time series data, high data sample freedom, and unbalanced data in the TSC task, this paper proposes a multi-pose marking learning algorithm to improve the performance of convolutional neural networks in processing TSC tasks. We thoroughly evaluate the performance of the algorithm on 46 datasets, and the experimental results show the significant effect of the algorithm. Basically, this algorithm made the convolutional neural network more resistant to noise in the data when it did the TSC task.
We still need to further study some details of the algorithm, such as the number of markers. The number of markers used in this paper is 2 by default, which is a hyperparameter of the algorithm. Its selection may have a great impact on the performance of the algorithm. In addition, the training dataset in the algorithm is divided into a test subset and a training subset according to the ratio of p ∶ 1 − p, and the choice of this split ratio also needs to be discussed. In addition, there is a lack of more comparative validation of baseline models in our study, such as comparison with models based on Markov transition fields and convolutional neural networks. We will remedy these deficiencies in future studies.

CONFLICT OF INTEREST
Authors have no conflict of interest relevant to this article.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available in the UCR Time Series Classification Archive at https:// www.cs.ucr.edu/~eamonn/time_series_data_2018.