A T-CNN Time Series Classication Method

: Time series classification is a basic task in the field of streaming data event analysis and data mining. The existing time series classification methods have the problems of low classification accuracy and low efficiency. To solve these problems, this paper proposes a T-CNN time series classification method based on a Gram matrix. Specifically, we perform wavelet threshold denoising on time series to filter normal curve noise, and propose a lossless transformation method based on the Gram matrix, which converts the time series to the time domain image and retains all the information of events. Then, we propose an improved CNN time series classification method, which introduces the Toeplitz convolution kernel matrix into convolution layer calculation. Finally, we introduce a Triplet network to calculate the similarity between similar events and different classes of events, and optimize the squared loss function of CNN. The proposed T-CNN model can accelerate the convergence rate of gradient descent and improve classification accuracy. Experimental results show that, compared with the existing methods, our T-CNN time series classification method has great advantages in efficiency and accuracy.


Introduction
A time series is the result of a set of sequence data observed for a potential process at a given sampling rate in an equal period [1]. In recent years, with the development of computer technology, time series has become more and more widely applied, such as disaster monitoring, safety analysis, climate change, stock funds and other fields. How to classify time series [2][3][4] has been a difficult point in the field of data mining. For example, mine disaster monitoring and early warning systems use microseismic sensors deployed around mines to storage real-time data. These mine disaster events usually last from a few seconds to more than 10 seconds. Classifying these events helps to summarize the feature of various types of disasters [5], which is of great significance for data analysis and disaster prevention.
The existing time series classification methods mostly use symbolic aggregation approximation (SAX) [6] and convolution neural network (CNN) [7] [8], but ignore the time attribute and the classification accuracy is not high. Therefore, to solve the above problems, this paper proposes a T-CNN time series classification method based on the Gram matrix [9]. The proposed method preserves the time attribute, improves the square loss function [10] of the CNN model in the full connection layer to enhance classification accuracy and efficiency. The main contributions of this paper are as follows:


Aiming at losing time attributes, a transformation method based on the Gram matrix is proposed, which converts time series into time-domain images without loss, and uses the wavelet threshold denoising method to filter out normal background noise [11] [12].
 In order to increase the network convergence rate, a Toeplitz convolution kernel matrix [13] is introduced into the CNN convolution layer, and the product of the Toeplitz matrix is used to replace the traditional convolution operation.


A T-CNN model is proposed, which introduces a triplet network [14] in the fully connected layer to calculate the difference function between the same class and different classes, and optimizes the square loss function of the CNN model, to improve the classification accuracy.

Related work
At present, researchers have studied the time series classification and obtained some research achievements.
Reference [15] proposed SAX, which divided the series into segments and converted each segment into a corresponding letter for classification. This method used the idea of aggregation to effectively reduce the dimension and increase the classification efficiency. However, this method lost a large number of data points in the time series, and the classification accuracy of this method was not high. Reference [16] proposed a trend turning point (TTP) method, which extracted the trend feature of the time series itself and required statistical analysis of the extreme points and inflection points of each time series. This method had a good effect on the data sets with specific rules, but not good for the data sets with large trend fluctuations.
Reference [17] proposed a time domain distance (TDD) method, which reflected the similarity by calculating the Euclidean distance between different sequences. The closer the distance, the higher the similarity. The classification speed of this method was fast, but the classification accuracy was low. Reference [18] proposed an interval classification method, which divided time series into equal-length intervals, calculated the mean value and standard deviation of each interval, and used support vector machines for classification. This method regarded time series as a set of discrete data points and lost the time attribute.
Reference [19] proposed a Shapelet method to find the Shapelet from the time series of known class labels to reflect the corresponding classification features. This method has high classification accuracy, but the time complexity of Shapelet extraction is high, the classification efficiency is low. Reference [20] proposed a classification method based on convolution neural networks, which used CNNs to effectively classify time series. However, the training process of this method was complicated, and did not consider the time attribute.
To solve the above problems, this paper proposes a T-CNN time series classification method based on the Gram matrix, which converts time series into time-domain images without loss and retains time attributes. The proposed T-CNN model based on convolution neural networks inputs the time-domain images into the model, improving the classification accuracy.

Gram matrix conversion of time series
Time series events are not one or a few abnormal perception data points, but a series of discrete data points that meet the threshold range in the time domain, and the same class events have similar features.
Definition 1 Time series events: A collection of continuous abnormal data initiated by the first outlier in the time series that exceeds the threshold range and lasts for a period of time. It can be represented as: where i e represents an outlier in an event, and  denotes the given threshold range.
According to Definition 1, each time series event has time attributes. Aiming at the lack of time attribute in the existing time series classification methods, we introduce the Gram matrix conversion method to realize the lossless conversion of time series.

Normal noise filtering
After Gram matrix conversion, the time-domain image histogram presents normal distribution. The existence of normal noise directly affects the conversion of images in the time domain. Therefore, before the Gram matrix is used to convert the time series, the data is preprocessed first, and then the wavelet threshold denoising method is used to filter out the normal background noise carried by the data.
The steps of the wavelet threshold denoising method are as follows: (1) Wavelet decomposition of signals. A wavelet is selected to determine the S-level wavelet decomposition, and then the S-level wavelet decomposition calculation is carried out for the signals.
(2) Threshold quantization of high-frequency coefficients in wavelet decomposition. From layer 1 to layer S, a threshold is selected for the high-frequency coefficients of each layer for threshold quantization. The threshold function used in this paper is calculated by: where 6745 . 0 / M   denotes the median value of the absolute value of the first-level wavelet decomposition coefficient, 0.6745 is the adjustment coefficient of the Gaussian noise standard deviation, and N denotes the length of the signal.
(3) Wavelet reconstruction of signals. According to the low-frequency coefficients of the S layer and the quantized high-frequency coefficients of the first layer to the S-th layer of the wavelet decomposition, the signal is reconstructed by wavelet.
Similar time series events have the same features, but the locations and intensities of the events are different, and the perceived event data are not at the same scale. Therefore, the normalization equation is used to keep the weight of the data feature dimension on the objective function consistent. The data normalization equation is as follows: where x denotes the data to be normalized, x´ denotes the normalized result, x min represents the minimum value in the time series data, and x max represents the maximum value in the time series data. Through data normalization conversion, all data can be normalized to [0,1]. For example, Fig.  1 shows a normalized time series T´ = {t 1´, t 2´, t 3´, …, t n´} calculated by Equation 3.

Time domain image conversion based on Gram matrix
The time attribute is an important attribute to determine the classification. Due to the different time attributes, different regular distributions are presented in chronological order. Therefore, this paper proposes a time-domain image conversion based on the Gram matrix. This method can retain the time attributes and convert a time series into an N×N two-dimensional matrix without loss.
The Gram matrix G is a matrix composed of the inner product of each pair of vectors in the form of Equation 3: where <a i , a j > denotes the inner product of two vectors.
The matrix G is a positive definite matrix. The proof of G is as follows: where Z´GZ represents a positive definite quadratic form, and G denotes a positive definite matrix. A positive definite matrix can retain the feature of the matrix by calculating the eigenvalues. After converted the time series by the Gram matrix, the time attributes can be retained. The result converted into the Gram matrix form is as follows: where G t denotes the time series after Gram matrix conversion, <t i , t j > is the inner product pair of the time series, and n is the length of the time series. An inner product represents the correlation between two points, G t a real symmetric matrix. From left to right of the first row and from top to bottom of the first column in the G t , the correlation between the first point and the subsequent points of the time series increases with time. Similarly, the second row and the second column are the correlation between the second point and the subsequent points. Therefore, the G t matrix from the upper left to the low right represents the order of correlation arrangement between two points with time increasing. The time attribute of the time series is retained in the Gram matrix.
Since the time series T is one-dimensional, and in the rectangular coordinate system, the inner product pair calculation requires two-dimensional information of the abscissa and the ordinate. Therefore, to better preserve the time attribute of the time series, we use the polar coordinate system to calculate the inner product pairs of the time series.
The time series T can be encoded into polar coordinates by Equations 6 and 7: where ' i t denotes the normalized time series, i is the time stamp in the time series, n is the length  According to Equation 6, arccos(0.5)=60º; i=3; N=5; according to Equation 7, the radius is 0.6. Thus, (0.6, 60º) is the coding result. As time goes on, the radius of the point becomes larger and larger, and gradually moves away from the center of the circle. Polar coding can completely preserve the time attribute by increasing the radius, and the numerical change can be represented by the angle change. Therefore, as shown in Equation 8, a new Gram matrix can be obtained by using the polar coordinate angle relationship between two points in the time series: where of the G t matrix, and the diagonal of the G t matrix is arranged in order.
As the position of the Gram matrix moves from the upper left corner to the lower right corner, the time series values are arranged into the matrix sequentially. In other words, we preserve the time attribute of the time series, and encode the time dimension into the geometric structure of the matrix. Each value of the matrix is equivalent to the pixel of the image, and each time series is converted into a time-domain image by the Gram matrix.

Time domain image T-CNN classification
As mentioned above, time series are converted to Gram time-domain images, and the Gram time-domain images are used as the input matrix of convolutional neural networks for classification. In order to solve the problems of complex computation and slow training speed of convolutional neural networks, we propose a method based on the Toeplitz matrix product to replace the convolution operation of the convolution layer, and introduce the idea of triplet network into the loss function to improve the efficiency and accuracy of classification.

Convolution based on Toeplitz matrix multiplication
The convolution operation based on the Toeplitz matrix product is shown in Fig. 3. In Fig. 3, the dark blue square represents the convolution kernel, and the light blue square represents the matrix to be convoluted. The convolution kernel is 2×2, the unconvoluted matrix is 3×3, the step size is 1. The traditional convolution is shown in the upper part of Fig. 3. The convolution kernel moves successively on the matrix to be convoluted according to the step size of 1, and it requires four traversals of the complete matrix to be convoluted. After each traversal, the convolution kernel and the matrix part with its repeated sum are multiplied and accumulated, and the obtained value is the local convolution result at the corresponding position. Since the traditional convolution needs to traverse the whole image, the computational complexity is high. As shown in the lower part of Fig. 3, based on the Toeplitz matrix product convolution, each 3×3 process matrix obtained by the convolution kernel traversal matrix is expanded in line with row order to obtain a 4×1×9 row matrix, forming a large matrix H. Then, the convolution matrix is expanded into a 9×1 column vector X in line with row arrangement order. The product of the large matrix H constructed by the convolution kernel and the column vector X to be constructed by the convolution matrix effectively replaces the convolution computation. Specifically, the convolution kernel matrix H is composed of 6 small matrices, which are respectively the matrix in the red box and the matrix in the yellow box, as well as the zero matrix in the two white parts.
In Fig. 3, the matrix in the red box conforms to the definition form of the Toeplitz matrix. Similarly, the matrix in the yellow box and the zero matrix are Toeplitz matrices. Therefore, the convolution kernel matrix constructed is a large Toeplitz matrix composed of several small Toeplitz matrices. The product of the Toeplitz matrix is used to replace the traditional convolution operation. The convolution kernel is directly constructed into the convolution kernel matrix without traversing the image in order of step size, and the product of the two matrices is calculated to reduce the computational complexity.

Toeplitz convolution kernel matrix Construction
In order to replace the convolution calculation with the Toeplitz matrix multiplication operation, the convolution kernel matrix H is constructed as the Toeplitz convolution kernel matrix H t . Given any convolution kernel matrix as follow: 12 11 The corresponding construction steps of the Toeplitz convolution kernel matrix are as follows: (1) A small Toeplitz matrix is generated from each row element of the convolution kernel matrix. Since the size of the convolution kernel matrix is C×D, the convolution kernel matrix H is divided into C Toeplitz matrices: (2) The small Toeplitz matrix obtained in Step (1) is formed into a large Toeplitz matrix:

Toeplitz Matrix Convolution
After obtaining the Toeplitz convolution kernel matrix from Section 4.1.1, the traditional convolution can be replaced by the Toeplitz matrix multiplication using Equation 11.
where 12 11 denotes the matrix to be convolved, 12 11 denotes the convolution kernel, H t is the Toeplitz convolution kernel matrix in Section 4.1.1, and X T is the column vector obtained by arranging all the elements of X in row order. Using the full convolution method, the matrix to be convolved is filled with zeros, and the result returns all the data after convolution. The row number of the convolution result matrix is M = A+C-1 and the column number of the convolution result matrix is N = B+D-1 . . Then the calculated column vector is rewritten into a 3×3 matrix according to M = A+C-1 = 3 and N = B+D-1 =3, which is the same as the results of the convolution calculation.
We use the Toeplitz matrix multiplication to effectively replace the convolution operation. In terms of time complexity, the input time-domain image size is A×B, and the convolution kernel size is C×D. The convolution operation requires the convolution kernel to continuously traverse the time domain image and calculate A×B×C×D times multiplication.
When using the Toeplitz matrix multiplication, it only needs to calculate the matrix multiplication once. It is realized from Fig. 3 that there are a large number of zeros in each row of the matrix which does not need to be calculated. Thus, the actual calculation of each row is C×D, the row number is the time of convolution kernel traverses, and approximately multiply A×B×C×D times. Therefore, during a calculation, the calculation amount of the two methods is roughly the same. However, when a new time-domain image is inputted into the traditional convolution each time, there are a large number of shift operations in the calculation, which greatly increases the calculation time.
Although it takes some time to construct the Toeplitz matrix, Toeplitz matrix multiplication only needs to construct the corresponding Toeplitz matrix once according to the given convolution kernel, and then can directly perform the matrix multiply calculation on all the input time-domain images to obtain the convolution result. In this way, for the datasets with a large number of sample sets and test sets, the convolution operation time will be greatly reduced.

T-CNN Model classification
When a CNN model is used for classification, its fully connected layers perform convergence operations, and a given loss function is required. In this paper, the Triplet network is introduced into the loss function, and then the T-CNN model is proposed.
where  is the weight of each neuron, b is the bias, and ) ( where a is the learning rate and where α represents the minimum distance of the difference function between different classes and between classes (set to 0.1 in this paper). In each reverse iteration, L T gradually approaches zero. As shown in Fig. 4, when the feature difference function L 1 of the same class of images is greater than the feature difference function L 2 of different classes of images minus the parameter α, L T is greater than zero, and the model is adjusted in reverse to make L 1 smaller and L 2 larger. In Fig. 4, A and P belong to the same class, while N does not belong to the same class as A and P. Before the adjustment, the distance between A and P is greater than that between A and N, and the difference function L T is greater than zero. Thus, the model parameters need to be adjusted in reverse. After the adjustment, the distance between A and N becomes larger, while the distance between A and P becomes smaller.
According to Equations 15 and 16, in each reverse iteration, it can be seen that L 1 will make the feature difference of the same class smaller, while L 2 will make the feature difference of different classes larger. On this basis, a Triplet loss function is proposed as shown in Equation 17: where ) , ( b R  denotes the CNN loss function,  and  are coefficients greater than zero. L 1 is the feature difference function of the same class, L 2 is the feature difference function of different classes. The values of  and  need to be further determined through experiments. Therefore, the new residual error of each layer by the backpropagation algorithm is as follows: The T-CNN model based on the Triplet network adds the feature difference function between the same class and the feature difference function between different classes into cross-entropy loss function, which is conducive to allow the parameters to extract features with larger differences more quickly in the process of parameter weight adjustment. The partial derivative of ) , ( b L  can make the backpropagation residual calculation to obtain new parameters  and b . Each iteration is more inclined to the direction of gradient descent, which can make the model converge faster and improve the classification efficiency.

Experiment
The experimental data set adopts 6 data sets of UCR2018 and microquake data sets, which contains three classes of time series event waveforms of mine microseismic signals. The dataset includes time series data of different sizes, lengths and categories. The size of the training set accounts for 40% of the total data set, and the model training framework is Tensorflow. The hardware and software environment of the experiment are shown in Table 1. The experiment data sets are shown in Table 2. Table 2 The experiment data sets. Train  Test  Class  Length   BME  72  180  3  128  ItalyPowerDemand  438  1096  2  24  SyntheticControl  240  600  6  60  Cricket_X  312  780  12  300  SwedishLeaf  450  1125  15  128  WordSynonyms  362  905  25  270  microquake  4000  10000  3  unequal In order to prevent local over convergence, the convolutional neural network structure is equipped with two convolutional layers. The convolutional layer 1 uses a convolution kernel of 5×5 size, and the maximum pooling window size is 3×3. The convolutional layer 2 uses a convolution kernel of 3×3 size, and the maximum pooling window size is 2×2. The convolutional layer 2 is followed by a standard fully connected layer. The activation function of this layer uses ReLU which makes the CNN model converge quickly. The loss function uses the Triplet loss function. The time-domain image output by the T-CNN model is p = {p 1 , p 2 , p 3 }, where p 1 , p 2 , p 3 represent the probabilities of three classes in the data set respectively. class = T(max(p)) is used for judgment classes, where the function T is a threshold function, and the threshold is set to 0.8 after repeated experiments. When max(p) is greater than the threshold set by the function T, the class of time-domain image is output. In order to make the experimental results more reliable, the experiment uses a 10-fold cross validation, and the final result is an average of 10 times.

Comparison of model iteration times
The T-CNN model adjusts itself to the optimal state through continuous forward conduction, in which the main adjustable parameters are the learning rate and the iteration times. During this experiment, the learning rate is set to 0.005, and the iteration times is adjusted. Fig. 5 shows the change of time series classification accuracy under different iteration times. Among them, the horizontal axis is the iteration times, and the vertical axis is the classification accuracy. From Fig. 5, it is known that when the iteration time is less than 40 times, the classification accuracy increases faster, the classification accuracy increases slower from 40 to 70 times, and the classification accuracy is the highest at 70 times. It is concluded that increasing the iteration times within a certain range can improve the classification accuracy, but exceeding the range will cause the model over-fitting.

Comparison of classification accuracy
The accuracy is the proportion of the number of samples correctly classified to the total number of samples. Fig. 6 is the comparison of classification accuracy by seven different classification models. The compared methods are symbolic aggregation approximation (SAX), Shapelet, Dynamic Time Warping (DTW), Collective of Transformation-based Ensembles (COTE), Complexity Invariant distance (CID) and CNN classification method.   Fig. 6, it is known that the T-CNN model converts the time series into the time domain images by using the Gram matrix, it can completely retain the time feature of the time series, and its classification accuracy is significantly better than other methods.

Comparison of classification precision
The precision is the proportion of samples predicted to be one class that is truly in that class, which is given by: where n ij is the number of samples predicted as the jth class in class i, and n c is the number of sample classes. Then calculate the average precision of all classes to get the average precision of the classification. Fig. 7 is the comparison of classification precision. From Fig. 7, due to the improved loss function of the T-CNN model, the classification precision is significantly better than other methods.

Comparison of classification recall
The recall is the proportion of samples detected in one class, which is given by: where n ij is the number of samples predicted as the j-th class in class i, and n c is the number of sample classes. Then calculate the average recall of all classes to get the average recall of the classification. Fig. 8 is the comparison of classification recall.