Accurate Multi-Scale Feature Fusion CNN for Time Series Classification in Smart Factory

Time series classification (TSC) has attracted various attention in the community of machine learning and data mining and has many successful applications such as fault detection and product identification in the process of building a smart factory. However, it is still challenging for the efficiency and accuracy of classification due to complexity, multi-dimension of time series. This paper presents a new approach for time series classification based on convolutional neural networks (CNN). The proposed method contains three parts: short-time gap feature extraction, multi-scale local feature learning, and global feature learning. In the process of short-time gap feature extraction, large kernel filters are employed to extract the features within the short-time gap from the raw time series. Then, a multi-scale feature extraction technique is applied in the process of multi-scale local feature learning to obtain detailed representations. The global convolution operation with giant stride is to obtain a robust and global feature representation. The comprehension features used for classifying are a fusion of short time gap feature representations, local multi-scale feature representations, and global feature representations. To test the efficiency of the proposed method named multi-scale feature fusion convolutional neural networks (MSFFCNN), we designed, trained MSFFCNN on some public sensors, device, and simulated control time series data sets. The comparative studies indicate our proposed MSFFCNN outperforms other alternatives, and we also provided a detailed analysis of the proposed MSFFCNN.


Introduction
Time series classification (TSC) is one of the critical factors for implementing smart factories in industry 4.0 due to many time series generated from the process of global production every day and everywhere, such as vibration signals and all kinds of sensor data: humidity sensor data, speed sensor data etc. All those data generated from machines react to the status of the machine or surroundings. By predicting the future status of machines and surroundings, decision-makers could make a reasonable adjustment in advance to avoid failure and downtime. Therefore, it enables the company and factory to increase production efficiency and save production costs. More importantly, ensure personal safety. Developing an accurate approach for TSC is the key to reach this achievement. TSC consists of distance-based methods, feature-based methods [Xing, Pei and Keogh (2010) ;Zheng, Liu, Chen et al. (2014); Cui, Chen and Chen (2016)], and machine learning-based methods. Distance-based methods such as k-nearest neighbor (KNN) and support vector machine (SVM) could be used for TSC directly on raw time series by calculating the similarity or dissimilarity between two time-sequences. The measurement of calculating the similarity is defining the distance function, such as Manhattan distance, Euclidean distance, and maximum distance etc. However, Both KNN and SVM [Chen, Xu, Zuo et al. (2019)] require equal length and are sensitive to the dimension of the time series. To overcome the above shortcomings, Batista et al. [Batista, Wang and Keogh (2011)] proposed the Dynamic Time Warping (DTW) to perform time series classification. Additionally, combining the KNN and DWT can effectively improve prediction accuracy [Chotirat and Eamonn (2005)]. It is still problematic that DWT requires too many time and computing resources [Zheng, Liu, Chen et al. (2014)].
The key idea of feature-based methods for TSC is capturing most representations of raw time series. Some statistical components such as mean, standard deviation, maximum value, minimum value, skewness etc. have been applied as statistical-domain features for TSC [Lei and Wu (2020)]. Meanwhile, the lower-dimensional reshaped time series has been employed as a feature representation for TSC. Principle component analysis (PCA) was employed for TSC [Li (2015); Cao, Tian and Bai (2015)] due to it has an excellent dimensionality reduction capacity. To obtain rich and robust feature representations, transformation-based methods have been proposed for time series classification such as Fourier transform (FT), Fast Fourier transform (FFT) and wavelet transform (WT) [Hendrik (2008) ;Zhang, Ho, Lin et al. (2006)]. They transform the raw time series from time-domain into frequency-domain to find the strong, and novel feature expression for accurately classifying time series. The useful information of the transformed signal is highly concentrated on the low-frequency part, but the noise on the high-frequency part. The transformation as mentioned above consists of discrete and continue transforms (DT and CT). Generally, CT requires more computing resources and time than DT. After feature selection, applying one classifier such as logistic and SVM to classify the time series. Recently, another feature-based method shapelet has proven that it is powerful for TSC [Arathi and Govardhan (2015); Hills, Lines, Baranauskas et al. (2014); Ahn and Lee (2018)] and became popular. The above methods only use one single model for TSC may lose some critical information. Therefore, some ensemble approaches combine multiple classifiers have been proposed for time series classification and have achieved excellent performance [Wang, Yan and Oates (2017)]. For instance, the Elastic Ensemble (PROP) ] combines 11 classifiers using elastic distance measures in a weighted ensemble scheme; The flat collective of transform-based ensembles (COTE) combines 35 different classifiers by extracting features from the time-frequency domain [Bagnall, Lines, Hills et al. (2015)]. However, all those methods need crafted feature selection and are time-consumption when having massive data, respectively.
Recently, the machine learning-based method for TSC has extracted various attention. Tradition machine learning methods contain SVM, decision tree, and random forest (RF). Although they can implement TSC without feature selection engineering, the performance is still deficient, which cannot satisfy our needs. Fortunately, the new deep learning-based method gives us a new choice. Notably, the convolutional neural network (CNN) has been a hot topic for TSC due to its black-box and dominant feature extraction characteristics. In CNN, extracted deep and robust features are fed into classifier automatically, that is, feature selection and classifying are integrated into one single framework for TSC. Follows are some CNN-based methods which already achieved great success in the domain of TSC. i.e., Cui et al. [Cui, Chen and Chen (2016)] proposed a multi-scale CNN (MCNN) for TSC and verified its excellent performance through 44 UCR time series archives. However, we still need to execute some extra transforms to obtain multi-scale feature representations. To avoid extra operations simultaneously keeping high performance, Wang et al. [Wang, Yan and Oates (2017)] developed three deep learning-based methods: Fully CNN (FCNN), deep multilayer perceptron (MLP), and the residual networks (ResNet). They evaluated and analyzed those three methods on the same benchmark datasets to Cui's paper [Cui, Chen and Chen (2016)], the comparative experiments indicate the premium performance of FCNN. Zhao et al. [Zhao, Lu, Chen et al. (2017)] applied a classic CNN architecture for TSC and tested on UCR and simulated data sets. The biggest challenge of TSC using UCR achieves [Chen, Keogh, Hu at al. (2016)] is that training sets are much less. However, most of CNN architectures need to train the model with massive data. Cui et al. [Cui, Chen and Chen (2016)] proposed a sliding window (SW) data augmentation technology to generate more data sets. Besides, Guennec et al. [Guennec, Malinowski and Tavenard (2016)] employed a window warping (WW) method for data augmentation and compared with SW. The above methods only focused on univariate TSC (UTSC). As a matter of factor, time series that occurred in real-life may be multivariate. To deal with multivariate time series, Zheng et al. [Zheng, Liu, Chen et al. (2014)] proposed a multi-channel deep CNN (MCDCNN) for MTSC. Two channels were adopted to extract features in his paper, and they treated one-source time series as one channel to extract represented region individually. Extracted features are combined by one full connection layer. Liu et al. designed a multivariate CNN (MVCNN) for fault detection on prognostics and health management (PHM) data set [Liu, Hsaio and Tu (2019)]. In his paper, multi-source time series is transformed into three-dimensional (3-D) tensor as the input and then adopted MVCNN with four stages to capture the rich features for MTSC. Motivated by MCNN [Cui, Chen and Chen (2016)], Jiang et al. proposed [Jiang, He, Yan et al. (2018)] multi-scale CNN (MSCNN) for fault diagnosis of wind turbine gearbox, in which, the authors adopted three scales of the mean of each time series at different time gap for feature extraction. However, it did not test in the case of using a lack of data to train the model. Yazdanbakhsh et al. [Yazdanbakhsh and Stick (2019)] proposed a dilated convolutional neural network for MTC and validated its effectiveness on two human activity recognition time series (WISDM v1. 1 [Kwapisz, Weiss, Moore et al. (2011)] and WISDM v. 2 [Lockhart, Weiss, Xue et al. (2011)]). However, the accuracy is still lower than some traditional feature-based methods. The main related works for TSC using CNN-based methods as summarized in Tab. 1.  [Zhao, Lu, Chen et al. (2017)] employed classical CNN to do both them.
Another drawback of the current CNN-based methods for TSC we discussed above is they still need other transformation or preprocessing operations. Moreover, some of them cannot deal with fewer data, as well. Motived by this, we designed, trained, and evaluated MSFFCNN to deal with those issues without any handcrafted feature engineering. We double that the reason for the above issues existing is cannot mine rich, robust, and detailed key representations of raw time series. Therefore, in our proposed model, we adopted a cascading structure to capture abundant feature maps. The main contributions of the manuscript are summarized as follows: • To the best of our understanding, a few types of research focused on using one CNN-based model for both UTSC and MTSC. This paper addresses this issue with MSFFCNN.
• The cascading structure of MSFFCNN is detailly designed, trained, and verified on both univariate and multivariate time series. The comparative studies indicate our proposed method outperforms other excellent methods without special preprocessing operations.
• The feature learning capacity of the proposed method is analyzed and learned inner feature map is visualized. The rest of the paper is arranged as follows. Section 2 gives the problem definition of TSC. Section 3 introduces CNN for TSC and depicts the proposed framework. Detailed experiment verifications are carried out in Section 4. In Section 5, we discuss the effectiveness of the proposed MSFFCNN. At last, we present the conclusions of this manuscript.

Problem definition
The TSC problem is to predict the label of the time series, which could be subdivided into UTSC and MTSC according to the dimensions of time series. The univariate time series in smart factory mainly are vibration signals could be expressed as a sequence of real-valued data points at different timestamps, which could be written as Eq.
(1). Where is the length of timestamps, denotes the ℎ data point of vibration signal . We give the detailed definition of two types of TSC problems as follows.
= { 1 , 2 , ⋯ , , ⋯ , } (1) Definition 1: The UTSC problem is considered as a vibration signal regarding a label that could be formalized as = {( , )| ∈ ∈ * } . Where is a complete data sample including vibration signal, and regarding label must be a positive integer, the number of depends on how many statuses it has in a real-case production environment. The whole data set is formalized as where is samples of data set. Definition 2: The multivariate time series is a set of univariate time series with the same timestamps that can be detonated as Eq. (2). where is the number of univariate time series. Empathy, the MTSC problem could be formalized as = {( , )| ∈ ∈ * }. The whole data set could be written as = { 1 , 2 , 3 , … , , … , }. In this paper, we will apply UCR data sets for UTSC verification and real-life multivariate time series for MTSC verification.
3 Methods This paper proposed a novel MSFFCNN model to solve the TSC problem without any handcrafted feature engineering operations. The following two subsections give the related pre-knowledge of CNN and a detailed description of the proposed MSFFCNN.

CNN
CNN is proposed by Lecun et al. [Lecun, Bengio and Hinton (2015)], it is a typical feedforward neural network and mainly employed for image classification, object detection. The standard CNN consists of two critical components: convolutional layer and pooling layer. Those two layers alternatively occurred in the CNN structure to extract rich feature maps within one sparse expression. The convolutional layer has properties of weights sharing, transformation, and scaling invariance. Consequently, it could extract robust feature representations. The process of convolution, as shown in Eq. (3). Where is ℎ input data points in the range of input values , is the feature maps after convolution operation with filters , is basis of each feature map.
After convolution operation, the convoluted data are processed with one activation function. This will make some data points active randomly to achieve the sparse representation. One of the most popular activation functions is the Rectified Linear Unit (ReLU), which enabled a nonlinear expression of input signals to enhance the representation ability, as shown in Eq. (4). = max (0, ) The pooling layer is used to reduce the dimension of features and speed up the convergence of the networks, which has three types of pooling down sample methods: Maximum pooling, minimum pooling, and average pooling. We give the maximum pooling operation as follows: where is the output of maximum value among the obtained feature maps from the preceding layer; usually, the convolutional and pooling operations alternatively occurred in the CNN model to extract the deep, abstract, and global feature expressions [Peng and Marculescu (2015); Chen, Li and Sanchez (2015)]. CNN can handle well with one dimensional (1-D) signals [Liu, Yang, Lv et al. (2019)], two dimensional (2-D) images, and three dimensional (3-D) videos [Arif, Wang, Fei et al. (2019)]. For the TSC problem, we mainly apply CNN to deal with 1-D vibration signals.

Proposed deep model
The architecture of proposed MSFFCNN for TSC consists of four parts: short-time gap feature learning, multi-scale feature learning, global feature learning and feature fusion, and output, as shown in Fig. 1. We concatenate multiple UTSC cells (marked with a red box in Fig. 1) for MTSC. Furthermore, different from the image classification problem, the input of the image classification problem is a two-dimension (2-D) image. The input of designed MSFFCNN is a one-dimension (1-D) time series. It learns feature through one raw time series for UTSC, and by combining the feature of multiple individual univariate time series for MTSC. The more detailed description of them will be depicted in the following subsections. We will take an example with UTSC to explain the workflow of MSFFCNN.

Figure 1:
The architecture of the proposed MSFFCNN for TSC problem, which consists of four parts: short-time gap feature learning, multi-scale local feature learning, global feature learning, feature fusion, and classification, respectively. The convolution layer is donated as Conv1D, and the max-pooling layer is donated as Max1D. Additionally, 1D means convolutional, and max pooling operations are utilized to process one dimensional (1-D) tensor. The term "Conca" means concatenating operation. This architecture is given in making three-source time series ( , , ) classification using three-scale convolution technology, which is donated as × , × , and × . For the UTSC problem, only by using one part of the architecture of MSFCNN as marked in the red box for learning multi-scale and global feature representation 3.2.1 Short-time gap feature learning Similar to the other CNN-based classification problems, the input shape of MSFFCNN requires 1-D tensor. Therefore, 1-D time series need to be transformed into tensor previously by using reshape operation as shown in Eq. (6) for UTSC and Eq. (7) for MTSC. Where the inputs of time series are and , contains some , as described in Section 2. Especially, is three in Fig. 1. For instance, by using transformation function, we could transform one with length 100 to a tensor with the shape of [1,100].
After the transformation, we applied wide convolution technology to capture the features which have the most relationship in short time gaps, and we named it as short-time gap features. The wide convolution technology is implemented by one convolution layer with various big-size filters. Especially, we adopted filter size in this paper is 64, the obtained short-time feature for each univariate time series could be written as Eq. (8). For MTSC, the obtained features in this sub-step are multiple .

Multi-scale local feature learning
The preceding feature expressions we obtained are processed by multi-scale convolution technology to extract the rich local feature representations. We defined three-scale convolution operations in Fig. 1, which are implemented through three convolution layers with different filter sizes, and each convolution layer contains various filters. We defined that three-scale convolution operations are 1 × 2 , 1 × 3 , and 1 × 4 , respectively. Moreover, we adopted a max-pooling operation to decrease the dimension of features and speed up the convergence. And we call the component of combined Conv1D and Max1D as . As shown in Fig. 1, two components existed at each scale of MSFFCNN. We formalized this process for UTSC, as shown in Eqs. (9)-(11). The filter numbers of convolution layers in are 16 and 32, and the pooling size is 2. After getting multi-scale local feature representation, we concatenate them on the x-axis for the next process. Therefore, extracted features keep their local characteristic simultaneously have connected with different weights; it could be written as Eq. (12). For the three-source time series, three concatenated features in this sub-step are symbolized as 1 , 2 , and 3 .

Global feature learning and feature fusion
The precious sub-step already extracted multi-local rich feature representations. We utilized global convolution and max-pooling layers to capture global representations. We named this sub-step as _ , in which the filter size is 4, and the filter number is 64. The extracted global features for UTSC are defined as Eq. (13). For MTSC, we need to concatenate multiple time series together on the -axis. Then concatenated representations are processed by _ , as shown in Eq. (14).

Classification
The aforementioned steps capture rich multi-scale and global feature representations; the above features are fed into two fully connected layers to extract deeper and more abstract representations. Besides, we employed a dropout layer to overcome overfitting and convergent the networks, and we set the rate of dropout layer is 0.5, it will select half of the neural nodes in the networks randomly to die when training the model. After full connection, the features could be more abstract and representative and are fed into the output layer. Output nodes depend on applications, that is how many statuses of time series have. The function [Liu, Wen, Yu et al. (2017)] is employed to generate the probability of each class for time series; the class corresponding to the maximum probability is the predicted label of time series. The proposed MSFFCNN only needs the raw time series for classification, as described in Eq. (15) for UTSC and Eq. (16) for MTSC. Where () is the trained deep model. = ( ) (15) = ( ) = ({ 1 , 2 , … , }) (16) The more configurations of our proposed MSFFCNN are given as follows: The activation function we applied is ReLU, we defined cross-entropy as loss function to update the networks, and "Adam" [Kingma and Ba (2014)] is adopted as an optimizer to tuning the loss of MSFFCNN.

Experiment verification
The platform we used in this study has an operating system of ubuntu 16.0.4 with memory 23.4 GB, Intel (R) i7-700 CPU, and processing speed 3.6 GHz.

UTSC verification 4.1.1 Data introduction
We adopted ten data sets from UCR [Chen, Keogh, Hu et al. (2016)] for UTSC verification, which consists of binary classification and multiple classification problems, the more detailed description of UCR data sets as shown in Tab. 2. As we can see from Tab. 2, six data sets belong to binary classification problems, and four data sets are multiple classifications problems. The training case is used to train the model, and the testing case is for verification. The biggest challenging thing is that the data used for training is much less. Additionally, different from Cui's et al. [Cui, Chen and Chen (2016) ], Guennec's et al. [Guennec, Malinowski and Tavenard (2016)], and Jiang's et al. methods [Jiang, He, Yan et al. (2018)], we only use raw time series without any data augmentation technologies to train, verify the proposed deep model. As we mentioned in 3.2.1, we set the hyperparameter of wide convolution operation filter size as 64 in MSFFCNN, the reason for that is the shortest length of time series is 65 in SonyAIBRobot2.  (2018)]. MSCNN has been proven that it is more potent than MCNN, and we implemented MSCNN (2) due to it obtained the best performance for wind turbine gearbox diagnosis. The structure of MSCNN (2) as given in Tab. 3. We also give the structure of classic CNN; the hyperparameter of classic CNN we set is the best in the author's paper. For FCNN, MLP, and ResNet, we adopted the authors' code to run. The term "raw" denotes original time series, and "mean (2)" means we adopted an overlapped method to generate the mean value of time series with stride 2. The format of convolutional operation is 1 ( _ , ); the default is 1. The format of max-pooling operation is 1 ( _ ), and "AveragePool" means average pooling operation. Term of "classes" is the number of time series statues. We adopted accuracy as the evaluation metric, and we run our proposed methods at 10 times to overcome the impact of randomness, the result is averaged accuracy on each data. The other methods' results are the best they reported could be found from Tab. 4. "Win" means the solver wined times on all those data sets, "Average Accuracy" means the averaged accuracy on those ten data sets, and we adopted standard deviation to evaluate the stability of each method. It is worthy to notice that we did not give all results of MCNN, because the author did not give all configuration information of networks, we adopted they reported values.
The findings indicate that our proposed MSFFCNN wined six best ranks and got the highest averaged accuracy on ten data sets. Even though ResNet wined the same best ranks to MSFFCNN, the averaged accuracy is only 0.9257, which is much lower than 0.9803 of the proposed method. Moreover, the standard deviation shows our proposed MSFFCNN outperforms others except for MSCNN, it is a little worse than MSCNN by comparing their standard deviation. However, MSFFCNN does not need any preprocessing operations before modelling. On the contrary, the MSCNN needs to calculate the mean values of each time series at different levels, which costs too much time and computing-resources. As a summary, our proposed MSFFCNN could predict the label of univariate time series accurately and stably without any preprocessing operations by directly inputting the original time series. To quantify the difference between the proposed method and other leading methods listed in Tab. 4, we compute the -value of the -test, as shown in Tab. 5. The results show that all those methods are the same distributions at a confidence level of 95%. The reason of that is we selected methods already act as a leading role for UTSC. Moreover, the proposed method and MSCNN could be divided into the first group for UTSC because the -value of them is near to 1 and average accuracy of them is too much similar around 0.98; Zhao's CCNN and ResNet could be divided into the second group because their -values are higher than 0.2. MLP and FCN could be divided into the last group. Three inputs:  (2) Output: Concatenate(Sub1, Sub2,Sub3)-Dense(100)-Dense(classes) Activation function is "ReLU" and optimizer is "Adam", loss function is "cross-entropy".

Feature extraction capacity validation
To explore and validate MSFFCNN's features extraction capacity. Firstly, we have visualized the activated output of the convolutional layer in MSFFCNN using t-SNE technology for reducing the dimension of extracted feature maps, as shown in Fig. 2. The data we adopted is SyntheticControl. From Fig. 2, we can see feature representations in (a) are mixed, difficult to identify, inseparable, and none-linear. After one-dimensional convolution with a large size filter, the feature expressions are becoming separable, which could be found from (b). After the multi-feature extraction process, learned feature repressions are discriminable, and independent as shown in (c), which satisfies the principle of classification: the maximum differences in an interclass, minimum variation in the same class. Additionally, the global feature learning step ensures more abstract and vibrant feature expressions are extracted from the preceding layer. The comprehensive feature representations, as shown in (d). The results indicate our proposed method could capture most of the useful representations for accurately predicting time series label with a cascading structure of CNN. Secondly, we have analyzed the inside feature maps to confirm the productive feature extraction capacity of our proposed method again. The visualization results using one sample from the SyntheticControl, as shown in Fig. 3. We can see the original time series oscillate with time stamp, and the range is from -2 to 2, as shown in Fig. 3(a). The short time feature map is obtained by one wide convolution operation, as shown in Fig. 3(b), it decreases from 2.0 to 1.75, the reason of that is the original time series consists of some noisy data points. Additionally, the reason that only positive values occurred in the feature map is converting the values of time series into RGB representations. Most of the parts in Fig. 3(b) are blue, which donates detailed information. The range of multi-scale feature maps increased from 1.75 of the short-time gap feature into 2.5, which can be seen in Fig.  3(c). It is easier to identify each time series through a multi-scale feature map because the values of the feature map increase a lot. Through the global feature learning process, the feature map is clear and identifiable, because it increased some lager values and decreased some white points with a lower value, which could be found from Fig. 3(d).

The influence of different scale
The above analysis is based on three-scale MSFFCNN. Different scales may influence the performance of classification. We have compared two-scale MSFFCNN (MSFFCNN (2)) and four-scale MSFFCNN (MSFFCNN (4)) to explore the influence of scale-level. The configuration of MSFFCNN (2) and MSFFCNN (4) are the same as MSFFCNN (3) we give in Fig. 1 expect for the scale level. We designed MSFFCNN (2) with 1 × 2 and 1 × 3 convolution operations, and MSFFCNN (4) with 1 × 2, 1 × 3, 1 × 4, and 1 × 5 convolution operations. The results as shown in Tab. 6. The findings show the accuracy increased with increasing of the scale level, and it is more stable when we apply more scales. We did the significance test using the -test, which indicates that there is no significant difference between these three methods. The pairwise -value of MSFFCNN (2) and MSFFCNN (3) is 0.962, MSFFCNN (4) and MSFFCNN (3) is 0.920, respectively. Another creditable phenomenon is that it is more stable with the increasing of scale level, which can be seen from the standard deviation in Tab. 6. It is worthy to notice that as the scale level increases, it needs more time to train and test. Therefore, we adopted MSFFCNN (3) for UTSC to balance the time resource while simultaneously keeping the high performance.

MTSC verification
The above analysis has confirmed the effectiveness and priority of MSFFCNN for UTSC. We also designed a concise experiment to verify the progressiveness of MSFFCNN for the MTSC problem as follows.

Data introduction
Two data sets we adopted to validate the effectiveness of MSFFCNN for the MTSC problem, they are WISDM v1-split and WISDM v.2, which is same to Yazdanbakhsh's et al. paper [Yazdanbakhsh and Stick (2019)]. WISDM v1-spilt consists of accelerometer data collected from 36 users regarding their daily six activities, including walking, jogging, upstairs, downstairs, sitting, and standing. WISDM v.2 consists of accelerometer data collected from 56 users while walking, jogging, stairs, sitting, standing, and lying down. 41279 samples are generated as the training part and 13162 samples as a testing part in WISDM v1-split, respectively. For WISDM v.2, giving 10396 training samples and 4456 testing samples. The detailed description of data sets and generation method could be found from Yazdanbakhsh's et al. paper [Yazdanbakhsh and Stick (2019)].

Comparative analysis
The evaluation metric for MTSC is F-1 scores of each label, and we compared our method to dilated CNN [Yazdanbakhsh and Stick (2019)], MCDCNN(2) [Zheng, Liu, Chen et al. (2014)], one feature-based method [Ravi, Wong, Lo et al. (2017)] named Ravelet, and classic CNN. We implement multiple classic CNN for MTSC based on Zhao's et al. method [Zhao, Lu, Chen et al. (2017)]. The comparison results are summarized in Tabs. 7 and 8. The findings from Tab. 7 indicate our proposed method outperforms other CNN-based methods, and a little lower than the feature-based method by comparing the averaged accuracy. We did a -test to quantify this difference between the proposed method and Ravelet's method. The result indicates there is no significant difference due to the -value (0.624) is much higher than 0.05. Moreover, our proposed method does not need any feature selection operation. By contraries, Ravelet's method needs. All those methods perform well on WISDM v1-spilt data, whose averaged accuracies are higher than 90.0%. The findings from Tab. 8 indicate our proposed method outperforms others, its averaged accuracy up to 92.8%. It has many improvements compared to other state-of-the-art methods with metric of averaged accuracy. It improved 3.5% compared to classic CNN [Zhao, Lu, Chen et al. (2017)], 30% to MCDCNN (2) [Zheng, Liu, Chen et al. (2014)], 4% to dilated CNN [Yazdanbakhsh and Stick (2019)], respectively. Moreover, it shows our method could accurately predict all kinds of labels over accuracy of 93% except for the activity of "Sitting." MCDCNN (2) almost cannot predict the labels of "sitting" and "standing." In summary, our proposed method can accurately handle the MTSC problem without any preprocessing and feature selection operations.

Discussion
We have proposed a novel deep model named MSFFCNN to extract productive and robust feature representations of raw sensor-related time series for predicting their labels automatically and accurately. Predicting the label of time series could be expressed as one TSC problem including UTSC and MTSC. The difficulty of UTSC is that training samples is much less than testing samples, as shown in Tab. 2, which requires the model could extract rich and robust feature representations from rare training samples. Besides, some CNN-based method still performs worse than feature-based method. Therefore, we designed one novel CNN-based deep model to capture multi-scale and global fusion features to overcome the above issues, as shown in Fig. 1.
We have compared our proposed method with other excellent deep learning-based models on ten UCR sensor-related data sets, as shown in Tab. 4, it indicates our proposed MSFFCNN is most competitive by using averaged accuracy, and also it is stable by comparing it to others in terms of standard deviation. To quantify this difference, we have calculated the -value of the -test, the results confirmed our proposed model belongs to the first class for UTSC, as shown in Tab. 5. As can be seen from Fig. 2, we have analyzed the feature learning capacity of the proposed MSFFCNN using t-SNE technology. It explained that wide convolution technology could extract more robust feature maps through raw time series, multi-scale feature learning sub-process could learn features at the multi-scale level. Also, fusion features are robust. The feature extracting capacity of our proposed model has been proven. Also, we have analyzed the interior feature to confirm the feature extraction capacity again, as shown in Fig. 3. We designed three models at different scale levels to analyze the impact of scale level, as shown in Tab. 6. The findings show the accuracy increases with the scale level, and it is more stable when we apply more scales. And we have proven that there is no significant difference when we utilize our model at different scale levels by computing the -value of the -test. Therefore, our proposed structure is stable and has a good generalization ability. As shown in Tabs. 7 and 8, we have designed, trained, and evaluated our proposed MSFFCNN for MTSC. The results indicate that our proposed model has state-of-the-art performance for MTSC without any preprocessing and handcrafted feature selection operations.

Conclusion
In conclusion, we have proposed a new, accurate, and stable approach for time series classification based on CNN. After a set of the feature extraction process in the proposed MSFFCNN, we could obtain multi-scale, global, and robust fusion feature representations. Our experiments show that our proposed method could predict the label of both univariate and multivariate time series accurately and automatically without any handcrafted feature engineering by using less training data set. In addition, the proposed framework is very stable, which is not sensitive to the scale levels.
We already tested the effectiveness and priority of the proposed method for time series classification problem. In the future, we will utilize the proposed MSFFCNN to process the Job Scheduling Problem (JSP), in which we consider JSP as one classification problem, MSFFCNN would deal with JSP as well.
Funding Statement: This work was supported by the Technology Innovation Program (20004205, The development of smart collaboration manufacturing innovation service platform in textile industry by producer-buyer B2B connection funded By the Ministry of Trade, Industry & Energy (MOTIE, Korea)).

Conflicts of Interest:
The authors declare no conflicts of interest.