Next Article in Journal
Efficient Time and Space Representation of Uncertain Event Data
Next Article in Special Issue
A Novel Multi-Dimensional Composition Method Based on Time Series Similarity for Array Pulse Wave Signals Detecting
Previous Article in Journal
Differential Evolution with Linear Bias Reduction in Parameter Adaptation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Boundary Distance-Based Symbolic Aggregate Approximation Method for Time Series Data

1
School of Computer Science, China University of Geosciences (Wuhan), 388 Lumo Road, Wuhan 430074, China
2
Department of Computer Science, University of Idaho, 875 Perimeter Drive MS 1010, Moscow, ID 83844-1010, USA
*
Author to whom correspondence should be addressed.
Algorithms 2020, 13(11), 284; https://doi.org/10.3390/a13110284
Submission received: 11 September 2020 / Revised: 25 October 2020 / Accepted: 26 October 2020 / Published: 9 November 2020
(This article belongs to the Special Issue Algorithms and Applications of Time Series Analysis)

Abstract

:
A large amount of time series data is being generated every day in a wide range of sensor application domains. The symbolic aggregate approximation (SAX) is a well-known time series representation method, which has a lower bound to Euclidean distance and may discretize continuous time series. SAX has been widely used for applications in various domains, such as mobile data management, financial investment, and shape discovery. However, the SAX representation has a limitation: Symbols are mapped from the average values of segments, but SAX does not consider the boundary distance in the segments. Different segments with similar average values may be mapped to the same symbols, and the SAX distance between them is 0. In this paper, we propose a novel representation named SAX-BD (boundary distance) by integrating the SAX distance with a weighted boundary distance. The experimental results show that SAX-BD significantly outperforms the SAX representation, ESAX representation, and SAX-TD representation.

1. Introduction

Time series data are being generated every day in a wide range of application domains [1], such as bioinformatics, finance, engineering, etc. [2]. The parallel explosions of interest in streaming data and data mining of time series [3,4,5,6,7,8,9] have had little intersection. Time series classification methods can be divided into three main categories [10]: feature based, model based and distance based. There are many methods for feature extraction, for example: (1) spectral analysis such as discrete Fourier transform (DFT) [11], (2) discrete wavelet transform (DWT) [12], where features of the frequency domain are considered, and (3) singular value decomposition (SVD) [13], where eigenvalue analysis is carried out in order to find an optimal set of features. The model-based classification methods include auto-regressive models [14,15] or hidden Markov models [16], among others. In distance-based methods, 1-NN [1] has been a widely used method due to its simplicity and good performance.
Until now, almost all the research in distance-based classification has been oriented to defining different types of distance measures and then exploiting them within the 1-NN classifiers. The 1-NN classifier is probably the simplest classifier among all classifiers, while its performance is also good. Dynamic Time Warping(DTW) [17] as a distance method used for 1-NN classifier makes the classification accuracy reach the maximum at that time. However, due to the high dimensions, high volume, high feature correlation, and multiple noises, it has brought great challenges to the classification of time series, and even makes the DTW unusable. In fact, all non-trivial data mining and indexing algorithms decrease exponentially with dimensions. For example, above 16–20 dimensions, the index structure will be degraded to sequential scanning [18]. In order to reduce the time series dimensions and have a low bound to the Euclidean distance. The Piecewise Aggregate Approximation(PAA) [19] and Symbolic Aggregate Approximate(SAX) [20] were brought up. The distance in the SAX representation has a lower bound to the Euclidean distance. Therefore, the SAX representation speeds up the data mining process of time series data while maintaining the quality of the data mining results. SAX has been widely used in mobile data management [21], financial investment [22], feature extraction [23]. In recent years, with the popularity of deep learning, applying deep learning methods to multivariate time series classification has also received attention [24].
SAX allows a time series of arbitrary length n to be reduced to a string of arbitrary length w [20] (w < n, typically w << n). The alphabet size α is also an arbitrary integer. The SAX representation has a major limitation. In the SAX representation, symbols are mapped from the average values of segments, and some important features may loss. For example, in Figure 1, if w = 6 and α = 6, time series a represented as ‘decfdb’.
However, it can be clearly seen from the Figure 1 that the time series changes drastically. Therefore, a compromise is needed to reduce the dimension of time series while improving the accuracy. ESAX representation can express the characteristics of time series in more detail [25]. It chooses a maximum, a minimum and the average value in each time series segment as the new feature, then map the new feature to strings according to the SAX method. For the same time series, in Figure 1 time series a can be represented as ‘adfeeffcaefffdaabc’.
SAX-TD (trend distance) method improves the accuracy of ESAX and reduce the complexity of symbol representation [26]. It uses fewer values than ESAX due to the strategy that one segment only needs one trend distance. In the Figure 1, the time series a is represented as’ −1.4d0.13e0.75c0.13f1.25d−0.25b−0.25′.
In this paper we propose a new method SAX-BD, in which BD means the boundary distance. For each divided time series segment, they have the maximum point and minimum point, the distance from them to average value named boundary distance. Time series a and b in Figure 1 have a high probability of being identified as the same if SAX-TD is used. However, in our method, time series A is represented as ’ d(−1.4,0.63)e(−0.38,0.38)c(1.36,−1.39)f(−0.25,0.36)d(1.5,−1)b(−0.38,0.38)’ and time series B is represented as ‘d(−1.4,1.2)e(0.38,−0.5)c(−1.8,1.9)f(0.38,−0.25)d(1.5,−1.0)b(0.45,−0.55)‘. Obviously, there is a big difference between the two representations.
In our work, there are three main contributions. First, we prove an intuitive boundary distance measure on time series segments. The average value of the segment and its boundary distance help measure different trends of time series more accurately. Our representation captures the trends in time series better than the SAX, ESAX, and the SAX-TD representations. Second, we discussed the SAX-TD algorithm and the ESAX algorithm and explained that our method is actually a generalization of these two methods. For their poorly performing data, our method has improved the result to a certain extent. For the data they outperform, we can basically keep the reduced accuracy rate in a very small range. Third, we proved that our improved distance measure not only keeps a lower-bound to the Euclidean distance, but also achieves a tighter lower bound than that of the original SAX distance.

2. Related Work

Given that the normalized time series have highly Gaussian distribution, we can simply determine the “breakpoints” that will produce equal-sized areas under Gaussian curve. The idea of the SAX algorithm is to assume that the average value of each segment has the equal probability in β i   t o   β i + 1 = 1 / a Each segment is projected into its own specified area. While w determines how many dimensions to reduce for the n-dimension time series. The smaller w is, the larger n/w, indicating that more information will be compressed.

2.1. The Distance Calculation by SAX

For example, a sequence data of length n is converted into w symbols. The specific steps are as follows:
Divides time series data into w segments of the same size according to the Piecewise Aggregate Approximation (PAA) algorithm. The average value of each time segment for example C ¯ = C 1 ¯ , C 2 ¯ , , C w ¯ the i t h   element of C ¯ ¯ is the average of the i t h   segment and is calculated by the following equation:
C i ¯ = w n j = ( n / w ) ( i + 1 ) + 1 ( n / w ) i C j
where C j   is one time point of time series C, using breakpoints to divide space into α equiprobable regions are determined
These breakpoints are arranged in list order as B = β 1 , β 2 , , β α 1 , They satisfy Gaussian distribution, and the spacing between β i and β i + 1 is 1/α.
Finally, the divided s time series segments are represented by these breakpoints. The SAX algorithm can map segments’ average values to alphabetic symbols. The mapping rule of SAX is as follows, if it is smaller than the lower limit of the minimum breakpoints, it is mapped to ‘a’, and then greater than a bit smaller than the second breakpoints lower limit is mapped to ‘b’. The symbols after these mappings can roughly indicate a time series.
Given two time series Q and C, the two time series are of the same length n, which is divided into w time segments.   Q ^ and C ^ are the symbol strings after they are transformed into SAX algorithm representation, then the SAX distance between Q and C can be expressed as follows:
MINDIST ( Q ^ , C ^ ) = n w i = 1 w ( d i s t ( q ^ , c ^ ) ) 2
Among them, the   dist ( q ^ , c ^ ) can be obtained according to Table 1, the query method can be expressed as the following equation:
dist ( q ^ , c ^ ) = { 0 β m a x ( q ^ , c ^ ) 1 β m i n ( q ^ , c ^ )   if   | q ^ c ^ | 1   otherwise  

2.2. An Improvement of SAX Distance Measure for Time Series

As the first to symbolize time series and can be effectively applied to time series classification, SAX has been recognized by many experts and scholars, however the shortcoming is also obviously to see. The larger w and smaller α, the more features will be lost for time series. To keep as much important information as possible, time series trend needed to be kept in the process of SAX dimensionality reduction. For example, in reference [26], some limitations of using SAX algorithm on the classification for time series were discussed. In this paper, these cases are listed separately in Figure 2.
In Figure 2, the average value of a and d, b and e, c, and f correspond to the same, but it is very clear that their time series tend to be significantly different. In order to correctly describe this difference, the author proposes using the SAX-TD method. According to the calculation rules of SAX-TD, the trend distance td (q, c) of two time series q and c is first calculated. The specific definition is as follows:
td ( q , c ) = ( Δ q ( t s ) Δ q ( c s ) ) 2 + ( Δ q ( t e ) Δ q ( c e ) ) 2                    
where t s and t e are the start point and end point of a time segment for the time series q and c. Respectively, the specific definition of Δ q ( t ) is as follows:
Δ q ( t ) = q ( t ) q ¯
Δ c ( t ) will be calculated in the same way, in article the author refers to this method as the tendency of time segments.
With the SAX method description, the time series Q and C respectively represented as follows:
Q :   Δ q ( 1 ) q 1 ^   Δ q ( 2 ) q 2 ^       Δ q ( w ) q w ^ Δ q ( w + 1 )  
C : Δ c ( 1 ) c 1 ^   Δ c ( 2 ) c 2 ^       Δ c ( w ) c w ^ Δ c ( w + 1 )
q 1 ^ , q 2 ^ q w ^ is a sequence symbolized by SAX, Δ q ( 1 ) ,   Δ q ( 2 )   , , Δ q ( w ) are the trend variations, and   Δ q ( W + 1 ) is the change of the last point.
The distance between two time series can be defined based on the trend distance as follows:
T D I S T ( Q ^ , C ^ ) = n w i = 1 w ( ( d i s t ( q i ^ , c i ^ ) ) 2 + w n ( t d ( q i , c i ) ) 2  
where Q ^ and C ^ , respectively, denote the time series Q and C, n is the length of Q and C, and w is the number of time segments. The distance between time series Q and C can be calculated by Equation (6). In this paper, the author proved that this method has a low bound to Euclidean distance, and the experimental results also showed that this method improves classification accuracy compared with ESAX.

3. SAX-BD: Boundary Distance-Based Method For Time Series

3.1. An Analysis of SAX-TD

First, in Figure 3, we select the b, c, e, f curve features from Figure 2.
The difference between b and e, c and f can be identified by using the SAX-TD algorithm, because, for b, the trend distance is Δ q ( t ) , and for the e is Δ q ( t ) , the final calculation results can distinguish these time series. However, if you want to identify the difference between a and c, e and f, there is a great possibility that you will fail. The trend distance for b and c, e and f are both the same value Δ q ( t ) or Δ q ( t ) , according to the calculation rules of SAX-TD, they will be judged as the same time series.

3.2. Our Method SAX-BD

In order to solve these problems, we propose to increase the boundary distance as a new reference instead of the trend distance. The details are as follows:
From Figure 4, we can see that this method is somewhat the same as ESAX, but it is different from ESAX.
The maximum and minimum value of each time segment is the boundary. The boundary distance of c is   Δ q ( t )   and for f is   Δ q ( t ) , shown in Equations (7) and (8):
    Δ q ( t ) m a x = q ( t ) m a x   q ¯        
Δ q ( t ) m i n = q ( t ) m i n q ¯        
The tendency change of c calculated by SAX-BD algorithm is Δ q ( t m a x ) , and the tendency change of f is Δ q ( t m a x ) . It can be seen that our method can also distinguish well. For b and c, the distance calculated using SAX-TD is the same, but in our method, SAX-BD, the equation is not equal to 0, indicating that there is a possibility of distinction between the time series. For the cases of g and h, according to our method, it is as follows:
Δ q ( t s ) = Δ q ( t m a x h ) a n d   a   a n d   Δ q ( t e ) = Δ q ( t m i n h )
Δ q ( t m a x g ) Δ q ( t s ) 0   a n d   Δ q ( t m i n g ) Δ q ( t e ) 0  

3.3. Difference from ESAX

In the ESAX method, the maximum, minimum, and mean values in each time segment are mapped and arranged according to the following formula:
< S 1 , S 2 , S 3 > = { < S max , S m i d , S min > i f   P max >   P m i d >   P min < S max , S min , S m i d > i f   P max >   P min >   P m i d < S m i d , S min , S max > i f   P m i d >   P min >   P max < S m i d , S max , S min > i f   P m i d >   P max >   P min < S min , S m i d , S max > i f   P min >   P m i d >   P max < S min , S max , S m i d > O t h e r w i s e
However, for the same feature points, we did not directly map these points in the same way as the ESAX method, mainly due to the following two reasons:
Firstly, in Figure 5, if you follow these points in the ESAX method diagram, for example, for A, B, C, D, E, they will all be mapped to the same character ‘f’ for α = 6 and F, G, H will be mapped to ‘a’. We directly retain these feature points and calculate the boundary distance. At this time, the specific values of A, B, C, D, E, and F, G, H can have a better discrimination.
Secondly, if we follow the ESAX method, we can see from Equation (11) that there may be a total of 6 comparisons. In fact, according to our method, only two comparisons are needed. Since our distance measurement is consistent with SXA-TD, the low correlation between Equation (13) and Euclidean distance has also been proven in the SAX-TD paper.
< Δ S 1 , Δ S 2 > = { < Δ S min , Δ S max > i f   P max >   P min < Δ S max , Δ S min > i f   P min >   P min
Finally, time series Q and C can be expressed as follows according to our method SAX-BD:
Q : q 1 ^ Δ S 1 1 Δ S 2 1   q 2 ^   S 1 2 S 2 2   q w ^   Δ S 1 w Δ S 2 w C :   c 1 ^ Δ C 1 1 Δ C 2 1   c 2 ^ Δ C 1 2 Δ C 2 2       c w ^ Δ C 1 w Δ C 2 w
The equation for calculating the distance between Q and C can be expressed as follows:
bd ( q , c ) = ( Δ q ( t ) 1 Δ C ( t ) 1 ) 2 + ( Δ q ( t ) 2 Δ C ( t ) 2 ) 2                
    B D I S T ( Q ^ , C ^ ) = n w i = 1 w ( ( d i s t ( q ^ , c ^ ) ) 2 + w n ( b d ( q i , c i ) ) 2  

3.4. Lower Bound

One of the most important characteristics of the SAX is that it provides a lower bounding distance measure. Lower bound is very useful for controlling errors and speeding up the computation. Below, we will show that our proposed distance also lower bounds the Euclidean distance.
According to [19,20], we have proved that the PAA distance lower bounds the Euclidean distance as follows:
i = 1 n ( q i c i ) 2 n w       i = 1 w ( q i ¯ , c i ¯ ) 2    
For proving the TDIST also lower bounds the Euclidean distance, we repeat some of the proofs here. Let Q and C be the means of time series Q and C respectively. We first consider only the single frame case (i.e., w = 1), Equation (14) can be rewritten as follows:
i = 1 n ( q i c i ) 2 n ( Q ¯ C ¯ ) 2    
Recall that Q is the average of the time series, so q i can be represented in terms of q i = Q ¯ Δ q i . The same applies to each point c i in C, Equation (15) can be rewritten as follows:
n ( Q ¯ C ¯ ) 2 + i = 1 n ( Δ q i Δ c i ) 2 n ( Q ¯ C ¯ ) 2        
Because i = 1 n ( Δ q i Δ c i ) 2 0 , Recall the definition in Equation (9) and Equation (12), ( Δ q ( t ) 1 Δ C ( t ) 1 ) 2 + ( Δ q ( t ) 2 Δ C ( t ) 2 ) 2 , we can obtain an inequality as follows (its’ obviously exists that the boundary distance in Δ q i ):
i = 1 n ( q i c i ) 2 ( Δ q ( t ) 1 Δ C ( t ) 1 ) 2 + ( Δ q ( t ) 2 Δ C ( t ) 2 ) 2  
Substituting Equation (16) into Equation (17), we get:
n ( Q ¯ C ¯ ) 2 + i = 1 n ( q i c i ) 2 n ( Q ¯ C ¯ ) 2 + ( b d ( q i , c i ) ) 2  
According to [20], the MINDIST lower bounds the PAA distance, that is:
n ( Q ¯ C ¯ ) 2 n ( Q ^ C ^ ) 2        
where   Q ^ and C ^ are symbolic representations of Q and C in the original SAX, respectively. By transitivity, the following inequality is true
( Q ¯ C ¯ ) 2 + i = 1 n ( Δ q i Δ c i ) 2 n ( d i s t ( Q ^ C ^ ) ) 2 + ( b d ( q i , c i ) ) 2    
Recall Equation (15), this means
i = 1 n ( Δ q i Δ c i ) 2 n ( ( d i s t ( Q ^ C ^ ) ) 2 + 1 n ( b d ( q i , c i ) ) 2   )  
N frames can be obtained by applying the single-frame proof on every frame, that is
i = 1 n ( q i c i ) 2   n w   i = 1 w ( ( d i s t ( q ^ , c ^ ) ) 2 + w n ( b d ( q i , c i ) ) 2  
The quality of a lower bounding distance is usually measured by the tightness of lower bounding (TLB).
T L = L o w e r   B o u n d i n g   D i s t a n c e ( Q , C ) E u c l i d e a n   D i s t a n c e ( Q , C )
The value of TLB is in the range [0, 1]. The larger the TLB value, the better the quality. Recall the distance measure in Equation (13), we can obtain that   T L B ( B D I S T ) T L B ( M I N I D I S T ) which means the SAX-BD distance has a tighter lower bound than the original SAX distance.

4. Experimental Validation

In this section, we will present the results of our experimental validation. We used a stand-alone desktop computer, Inter(R) Core(TM) i5-4440 CPU @ 3.10 GHz.
Firstly, we introduce the data sets used, the comparison methods and parameter settings. Then, in order to show experimental results more conveniently, we evaluate the performances of the proposed method in terms of classification accuracy rate shown in figures and classification error rate shown in tables.

4.1. Data Sets

According to the latest time series database UCRArchive2018, in order to make the experimental results more credible, 100 data sets were obtained on the basis of removing null values in the data and show in Table 2. Each data set is divided into a training set and a testing set and a detailed documentation of the data. The datasets contain classes ranging from 2 to 60 and have the lengths of time series varying from 15 to 2844. In addition, the types of the data sets are also diverse, including image, sensor, motion, ECG, etc. [27].

4.2. Comparison Methods and Parameter Settings

We compare the result with the above-mentioned ESAX and SAX. We also compare with SAX-TD, which is another latest research improving SAX based on the trend distance. We do the evaluation on the classification task, of which the accuracy is determined by the distance measure. In this way, it is well proved that our method improves the SAX-TD method. To compare the classification accuracy, we conduct the experiments using the 1 nearest neighbor (1-NN) classifier by reading the sun’s paper [26].
To make it fairer for each method, we use the testing data to search for the best parameters w and α. For a given timeseries of length n, w and α are picked using the following criteria [28]):
For w, we search for the value from 2 up ton = 2 and double the value of w each time.
For α, we search for the value from 3 up to 10.
If two sets of parameter settings produce the same classification error rate, we choose the smaller parameters.
The dimensionality reduction ratios are defined as follows:
D i m e n s i o n a l i t y   R e d u c t i o n   R a t i o = N u m b e r   o f   t h e   r e d u c e d   d a t a   p o i n t s N u m b e r   o f   t h e   O r i g i n a l   d a t a   p o i n t s

4.3. Result Analysis

To make the table fit all the data, we abbreviate SAX-TD for SAXTD and SAX-BD for SAXBD. The overall classification results are listed in Table 3, where entries with the lowest classification error rates are highlighted. SAX-BD has the lowest error in the most of the data sets (69/100), followed by SAX-TD (22/100), EU (15/100). In some cases, it performs much better than the other two methods, and even achieves a 0 classification error rate.
We use the Wilcoxon signed ranks test to test the significance of our method against other methods. The test results are displayed in Table 4. Where n + , n _ , and n 0 denote the numbers of data sets where the error rates of the SAX-BD are lower, larger than and equal to those of another method respectively. The p-values (the smaller a p-value, the more significant the improvement) demonstrate that our distance measure achieves a significant improvement over the other four methods on classification accuracy.
To provide a more intuitive illustration of the performance of the different measures compared in Table 3 and Table 4, we use scatter plots for pairwise comparisons. In a scatter plot, the accuracy rates of two measures under comparison are used as the x and y coordinates of a dot, where a dot represents a data set. When a dot above the diagonal line, the ‘y’ method performs better than the ‘x’ method. In addition, the further a dot is from the diagonal line, the greater the margin of an accuracy improvement. The region with more dots indicates a better method than the other.
In the following, we explain the results in Figure 6.
We illustrate the performance of our distance measure against the Euclidean distance, SAX distance, ESAX distance, SAX-TD distance in Figure 6a–d, respectively. Our method outperforms the other four methods by a large margin, both in the number of points and the distance of these points from the diagonals. From these figures, we can see that most of the points are far away from the diagonals, which indicates that our method has much lower error rates on most of the data sets.
To show the continuity performance of our method and other three methods, we run the experiments on data set Yoga. We firstly compare the classification error rates with different w while α is fixed at 3, and then with different α while w is fixed at 4 (to illustrate the classification error rates using small parameters). Secondly, we use w, which varies, while α is fixed at 10, and then α varies while w is fixed at 128 (to illustrate the classification error rates using large parameters).
SAX-TD and SAX-BD has lower error rates than the other two methods when the parameters are small and large, SAX-BD has lower error rates than the SAX-TD. The results are shown in Figure 7.
The dimensionality reduction ratios are calculated using the w when the four methods achieve their smallest classification error rates on each data set, shown in Figure 8. The SAX-TD and SAX-BD representation use more values than SAX, SAX-TD use fewer values than ESAX. In fact, our method has a low dimensionality reduction ratio in majority datasets, and even uses fewer values than SAX-TD.
We also recorded the running time of SAX-TD and SAX-BD with different α from 3 to 10 shown in Figure 9. The experimental results indicated that we have made a greater improvement at the cost of only a little time, and that’s well worth it.

5. Conclusions

Our proposed SAX-BD algorithm uses the boundary distance as a new distance metric to obtain a new time series representation. We analyze some cases that ESAX and SAX-TD cannot solve, and it is known that the classification accuracy of ESAX algorithm is not as good as SAX-TD. We combine the advantages of these two methods, analyzing and deriving our method as an extension of these two methods. We also proved that our improved distance measure not only keeps a lower-bound to the Euclidean distance, but also has a low dimensionality reduction ratio in majority datasets. In terms of the expression complexity of time series, our algorithm SAX-BD and ESAX algorithm are three times more than SAX, and two times more than SAX-TD. However, in terms of running time, we spend just a little more. In terms of the classification accuracy, we have improved this a lot, that means a good compromise has made between dimensional reduction and classification accuracy. For future work, we intend to change our original algorithm to make time advantage.

Author Contributions

Methodology, Z.H. and S.L.; Project administration, Z.H.; Software, H.Z.; Writing—review & editing, X.M. All authors have read and agreed to the published version of the manuscript.

Funding

The work described in this paper was supported by the National Natural Science Foundation of China (41972306, U1711267, 41572314) and the geo-disaster data processing and intelligent monitoring project.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Abanda, A.; Mori, U.; Lozano, J.A. A review on distance based time series classification. Data Min. Knowl. Discov. 2018, 33, 378–412. [Google Scholar] [CrossRef] [Green Version]
  2. Keogh, E.J.; Kasetty, S. On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration. Data Min. Knowl. Discov. 2003, 7, 349–371. [Google Scholar] [CrossRef]
  3. Vlachos, M.; Kollios, G.; Gunopulos, D. Discovering similar multidimensional trajectories. In Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, 26 Febuary–1 March 2002; p. 673. [Google Scholar]
  4. Lonardi, J.; Patel, P. Finding motifs in time series. In Proceedings of the 2nd Workshop on Temporal Data Mining, Washington, DC, USA, 24–27 August 2002. [Google Scholar]
  5. Keogh, E.; Lonardi, S.; Chiu, B.Y. Finding surprising patterns in a time series database in linear time and space. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, Canada, 23–25 July 2002. [Google Scholar]
  6. Kalpakis, K.; Gada, D.; Puttagunta, V. Distance measures for effective clustering of ARIMA time-series. In Proceedings of the 2001 IEEE International Conference on Data Mining, San Jose, CA, USA, 29 November–2 December 2001; pp. 273–280. [Google Scholar]
  7. Huang, Y.-W.; Yu, P.S. Adaptive query processing for time-series data. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD ’99, San Diego, CA, USA, 15–18 August 1999; pp. 282–286. [Google Scholar]
  8. Chan, K.-P.; Fu, A.W.-C. Efficient time series matching by wavelets. In Proceedings of the 15th International Conference on Data Engineering (Cat. No.99CB36337), Sydney, Australia, 23–26 March 1999; pp. 126–133. [Google Scholar]
  9. Dasgupta, D.; Forrest, S. Novelty detection in time series data using ideas from immunology. In Proceedings of the International Conference on Intelligent Systems, Ahmedabad, Indian, 15–16 November 1996. [Google Scholar]
  10. Xing, Z.; Pei, J.; Keogh, E. A brief survey on sequence classification. ACM SIGKDD Explor. Newsl. 2010, 12, 40–48. [Google Scholar] [CrossRef]
  11. Faloutsos, C.; Ranganathan, M.; Manolopoulos, Y. Fast subsequence matching in time-series databases. ACM Sigmod Rec. 1994, 23, 419–429. [Google Scholar] [CrossRef] [Green Version]
  12. Popivanov, I.; Miller, R. Similarity search over time-series data using wavelets. In Proceedings of the 18th International Conference on Data Engineering, Washington, DC, USA, 26 February–1 March 2002. [Google Scholar]
  13. Korn, F.; Jagadish, H.V.; Faloutsos, C. Efficiently supporting ad hoc queries in large datasets of time sequences. ACM Sigmod Rec. 1997, 26, 289–300. [Google Scholar] [CrossRef]
  14. Bagnall, A.; Janacek, G. A Run Length Transformation for Discriminating Between Auto Regressive Time Series. J. Classif. 2013, 31, 154–178. [Google Scholar] [CrossRef] [Green Version]
  15. Corduas, M.; Piccolo, D. Time series clustering and classification by the autoregressive metric. Comput. Stat. Data Anal. 2008, 52, 1860–1872. [Google Scholar] [CrossRef]
  16. Smyth, P. Clustering sequences with hidden Markov models. In Proceedings of the Advances in Neural Information Processing Systems, Curitiba, Brazil, 2–5 November 1997. [Google Scholar]
  17. Berndt, D.J.; Clifford, J. Using dynamic time warping to find patterns in time series. In Proceedings of the KDD Workshop, Seattle, WA, USA, 31 July 1994. [Google Scholar]
  18. Keogh, E.J.; Chakrabarti, K.; Pazzani, M.J.; Mehrotra, S. Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. Knowl. Inf. Syst. 2001, 3, 263–286. [Google Scholar] [CrossRef]
  19. Hellerstein, J.M.; Koutsoupias, E.; Papadimitriou, C.H. On the analysis of indexing schemes. In Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems—PODS ’97, Tucson, AZ, USA, 12–14 May 1997. [Google Scholar]
  20. Lin, J.; Keogh, E.; Lonardi, S.; Chiu, B. A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, San Diego, CA, USA, 13 June 2003. [Google Scholar]
  21. Tayebi, H.; Krishnaswamy, S.; Waluyo, A.B.; Sinha, A.; Abouelhoda, M.; Waluyo, A.B.; Sinha, A. RA-SAX: Resource-Aware Symbolic Aggregate Approximation for Mobile ECG Analysis. In Proceedings of the 2011 IEEE 12th International Conference on Mobile Data Management, Lulea, Sweden, 6–9 June 2011; Volume 1, pp. 289–290. [Google Scholar]
  22. Canelas, A.; Neves, R.F.; Horta, N. A new SAX-GA methodology applied to investment strategies optimization. In Proceedings of the Fourteenth International Conference on Genetic and Evolutionary Computation Conference Companion—GECCO Companion ’12, Philadelphia, PA, USA, 7–11 July 2012; pp. 1055–1062. [Google Scholar]
  23. Rakthanmanon, T.; Keogh, E. Fast shapelets: A scalable algorithm for discovering time series shapelets. In Proceedings of the 2013 SIAM International Conference on Data Mining, Austin, TX, USA, 2–4 May 2013. [Google Scholar]
  24. Zheng, Y.; Liu, Q.; Chen, E.; Ge, Y.; Zhao, J.L. Time Series Classification Using Multi-Channels Deep Convolutional Neural Networks. In Proceedings of the Lecture Notes in Computer Science, Leipzig, Germany, 22–26 June 2014; Springer Science and Business Media LLC: Macau, China, 2014; pp. 298–310. [Google Scholar]
  25. Lkhagva, B.; Suzuki, Y.; Kawagoe, K. New Time Series Data Representation ESAX for Financial Applications. In Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW’06), Atlanta, GA, USA, 3–7 April 2006; p. 115. [Google Scholar]
  26. Sun, Y.; Li, J.; Liu, J.; Sun, B.; Chow, C. An improvement of symbolic aggregate approximation distance measure for time series. Neurocomputing 2014, 138, 189–198. [Google Scholar] [CrossRef]
  27. Dau, H.A.; Bagnall, A.; Kamgar, K.; Yeh, C.-C.M.; Zhu, Y.; Gharghabi, S.; Ratanamahatana, C.A.; Keogh, E. The UCR time series archive. IEEE/CAA J. Autom. Sin. 2019, 6, 1293–1305. [Google Scholar] [CrossRef]
  28. Lin, J.; Keogh, E.; Wei, L.; Lonardi, S. Experiencing SAX: A novel symbolic representation of time series. Data Min. Knowl. Discov. 2007, 15, 107–144. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Financial time series A and B have the same SAX symbolic representation ‘decfdb’ in the same condition where the length of time series is 30, the number of segments is 6 and the size of symbols is 6. However, they are different time series.
Figure 1. Financial time series A and B have the same SAX symbolic representation ‘decfdb’ in the same condition where the length of time series is 30, the number of segments is 6 and the size of symbols is 6. However, they are different time series.
Algorithms 13 00284 g001
Figure 2. Several typical segments with the same average value but different trends [26]. Segment a and d, b and e, c and f are in opposite directions while all in same mean value.
Figure 2. Several typical segments with the same average value but different trends [26]. Segment a and d, b and e, c and f are in opposite directions while all in same mean value.
Algorithms 13 00284 g002
Figure 3. Several typical segments with the same average value and same trends but different boundary distance. Segment b and c, e and f with the same SAX representation and trend distance while they are different segments.
Figure 3. Several typical segments with the same average value and same trends but different boundary distance. Segment b and c, e and f with the same SAX representation and trend distance while they are different segments.
Algorithms 13 00284 g003
Figure 4. Several typical segments with the same average value but boundary distance. Segment a and d, b and e, c and f are in opposite directions while all in same mean value. The trend distance is replaced by boundary distance.
Figure 4. Several typical segments with the same average value but boundary distance. Segment a and d, b and e, c and f are in opposite directions while all in same mean value. The trend distance is replaced by boundary distance.
Algorithms 13 00284 g004
Figure 5. Time series represented as ‘adfeeffcaefffdaabc’ by ESAX [25]. Where the length of time series is 30, the number of segments is 6 and the size of symbols is 6. The capital letters A–H represent the maximum and minimum values in every segment.
Figure 5. Time series represented as ‘adfeeffcaefffdaabc’ by ESAX [25]. Where the length of time series is 30, the number of segments is 6 and the size of symbols is 6. The capital letters A–H represent the maximum and minimum values in every segment.
Algorithms 13 00284 g005
Figure 6. The SAX-BD algorithm is compared with other algorithms for accuracy. (ad) represents a comparison between SAX-BD with Euclidean, SAX, ESAX, SAX-TD. The more dots above the red slash, the better performs of SAX-BD.
Figure 6. The SAX-BD algorithm is compared with other algorithms for accuracy. (ad) represents a comparison between SAX-BD with Euclidean, SAX, ESAX, SAX-TD. The more dots above the red slash, the better performs of SAX-BD.
Algorithms 13 00284 g006
Figure 7. The classification error rates of SAX, ESAX, SAX-TD and SAX-BD with different parameters w and α. For (a), on Gun-Point, w varies while α is fixed at 3, for (b), on Gun-Point, varies while w is fixed at 4. For (c), on Yoga, w varies while α is fixed at 10, for (d), on Yoga, varies while w is fixed at 128.
Figure 7. The classification error rates of SAX, ESAX, SAX-TD and SAX-BD with different parameters w and α. For (a), on Gun-Point, w varies while α is fixed at 3, for (b), on Gun-Point, varies while w is fixed at 4. For (c), on Yoga, w varies while α is fixed at 10, for (d), on Yoga, varies while w is fixed at 128.
Algorithms 13 00284 g007
Figure 8. Dimensionality reduction ratio of the four methods.
Figure 8. Dimensionality reduction ratio of the four methods.
Algorithms 13 00284 g008aAlgorithms 13 00284 g008b
Figure 9. The running time of different methods with different values of α.
Figure 9. The running time of different methods with different values of α.
Algorithms 13 00284 g009
Table 1. A lookup table for breakpoints with the alphabet size from 3 to 10.
Table 1. A lookup table for breakpoints with the alphabet size from 3 to 10.
β i   325678910
β 1 −0.43−0.67−0.84−0.97−1.07−1.15−1.22−1.28
β 2 −0.430−0.25−0.43−0.57−0.67−0.76−0.84
β 3 0.670.250−0.18−0.32−0.43−0.52
β 4 0.840.430.180−0.14−0.25
β 5 0.970.570.320.140
β 6 1.070.670.430.25
β 7 1.150.760.52
β 8 1.220.84
β 9 1.28
Table 2. 100 different types of time series datasets.
Table 2. 100 different types of time series datasets.
IDTypeNameTrainTestClassLength
1DeviceACSF1100100101460
2ImageAdiac39039137176
3ImageArrowHead361753251
4SpectroBeef30305470
5ImageBeetleFly20202512
6ImageBirdChicken20202512
7SimulatedBME301503128
8SensorCar60604577
9SimulatedCBF309003128
10TrafficChinatown20343224
11SensorCinCECGTorso40138041639
12SpectroCoffee28282286
13DeviceComputers2502502720
14MotionCricketX39039012300
15MotionCricketY39039012300
16MotionCricketZ39039012300
17ImageDiatomSizeReduction163064345
18ImageDistalPhalanxOutlineAgeGroup400139380
19ImageDistalPhalanxOutlineCorrect600276280
20ImageDistalPhalanxTW400139680
21SensorEarthquakes3221392512
22ECGECG200100100296
23ECGECGFiveDays238612136
24EOGEOGHorizontalSignal362362121250
25EOGEOGVerticalSignal362362121250
26SpectroEthanolLevel50450041751
27ImageFaceAll560169014131
28ImageFaceFour24884350
29ImageFacesUCR200205014131
30ImageFiftyWords45045550270
31ImageFish1751757463
32SensorFordA360113202500
33SensorFordB36368102500
34HRMFungi1818618201
35MotionGunPoint501502150
36MotionGunPointAgeSpan1353162150
37MotionGunPointMaleVersusFemale1353162150
38MotionGunPointOldVersusYoung1363152150
39SpectroHam1091052431
40ImageHandOutlines100037022709
41MotionHaptics15530851092
42ImageHerring64642512
43DeviceHouseTwenty4011922000
44MotionInlineSkate10055071882
45EPGInsectEPGRegularTrain622493601
46EPGInsectEPGSmallTrain172493601
47SensorInsectWingbeatSound220198011256
48SensorItalyPowerDemand671029224
49DeviceLargeKitchenAppliances3753753720
50SensorLightning260612637
51SensorLightning770737319
52SpectroMeat60603448
53ImageMedicalImages3817601099
54TrafficMelbournePedestrian119424391024
55ImageMiddlePhalanxOutlineAgeGroup400154380
56ImageMiddlePhalanxOutlineCorrect600291280
57ImageMiddlePhalanxTW399154680
58SensorMoteStrain201252284
59ECGNonInvasiveFetalECGThorax11800196542750
60ECGNonInvasiveFetalECGThorax21800196542750
61SpectroOliveOil30304570
62ImageOSULeaf2002426427
63ImagePhalangesOutlinesCorrect1800858280
64SensorPhoneme2141896391024
65HemodynamicsPigAirwayPressure104208522000
66HemodynamicsPigArtPressure104208522000
67HemodynamicsPigCVP104208522000
68SensorPlane1051057144
69PowerPowerCons1801802144
70ImageProximalPhalanxOutlineAgeGroup400205380
71ImageProximalPhalanxOutlineCorrect600291280
72ImageProximalPhalanxTW400205680
73DeviceRefrigerationDevices3753753720
74SpectrumRock205042844
75DeviceScreenType3753753720
76SpectrumSemgHandGenderCh230060021500
77SpectrumSemgHandMovementCh245045061500
78SpectrumSemgHandSubjectCh245045051500
79SimulatedShapeletSim201802500
80ImageShapesAll60060060512
81DeviceSmallKitchenAppliances3753753720
82SimulatedSmoothSubspace150150315
83SensorSonyAIBORobotSurface120601270
84SensorSonyAIBORobotSurface227953265
85SpectroStrawberry6133702235
86ImageSwedishLeaf50062515128
87ImageSymbols259956398
88SimulatedSyntheticControl300300660
89MotionToeSegmentation1402282277
90MotionToeSegmentation2361302343
91SensorTrace1001004275
92ECGTwoLeadECG231139282
93SimulatedTwoPatterns100040004128
94SimulatedUMD361443150
95SensorWafer100061642152
96SpectroWine57542234
97ImageWordSynonyms26763825270
98MotionWorms181775900
99MotionWormsTwoClass181772900
100ImageYoga30030002426
Table 3. 1-NN classification error rates of different methods.
Table 3. 1-NN classification error rates of different methods.
IDEU ErrorSAX ErrorSAX wSAX RatioSAX αESAX ErrorESAX wESAX RatioESAX αSAXTD ErrorSAXTD wSAXTD RatioSAXTD αSAXBD ErrorSAXBD wSAXBD RatioSAXBD α
10.4600.5802560.17580.7602560.52630.38040.00530.40020.0043
20.3890.895640.36490.890320.54570.284320.36430.263320.5454
30.2000.309320.127100.349640.765100.183320.25530.160160.1915
40.3330.4671280.27270.4001280.81770.1671280.54540.2001280.8176
50.2500.150640.12540.100160.09440.150160.06350.10020.0123
60.4500.3002560.50040.2001280.75050.20040.01640.20020.0123
70.1730.153160.12570.16080.18870.147160.25030.06040.0944
80.2670.2832560.444100.2831280.66660.133320.11140.117160.0833
90.1480.084160.12580.25040.09490.08880.12550.02740.0944
100.0580.467160.66770.12581.00070.04180.66730.04140.5003
110.1030.0971280.07890.108640.117100.0721280.15690.062640.1178
120.0000.4292560.89540.32140.04260.000160.11230.000160.1683
130.4240.480160.02260.432160.06740.4042560.71130.3801280.5333
140.4230.3851280.42790.444640.640100.400320.21360.331160.1605
150.4330.441640.21380.523640.64080.441160.10770.372160.1606
160.4130.387640.213100.426640.640100.387320.21360.323160.1607
170.0650.06240.01260.23220.01740.03980.04640.02920.0173
180.3740.317320.40040.38180.30040.331160.40040.27340.1503
190.2830.348640.80060.30820.07580.264320.80040.246160.6004
200.3670.432160.20060.439160.60090.360320.80040.367321.2005
210.2880.2452560.50060.259640.37550.252160.06330.29580.0473
220.1200.080320.33360.140321.00050.070320.66740.090642.0005
230.2030.114640.47180.211160.35380.081160.23540.11720.0443
240.5580.616320.02690.619160.03880.638160.02640.59940.0106
250.6380.5992560.20590.57580.01980.530160.02640.60280.0196
260.7260.7322560.14650.74820.00330.694320.03740.7025120.8774
270.2860.320320.24490.250320.73380.227160.24450.206320.7333
280.2160.159320.09180.205640.54990.136320.18350.125160.1373
290.2310.252320.244100.334320.733100.251160.24490.173160.3665
300.3690.327640.23790.319320.35680.3342561.89670.3251281.4225
310.2170.4512560.55380.6231280.82980.143640.27640.166320.2075
320.3350.3272560.51270.3361280.76880.304640.25630.315640.3843
330.3940.4281280.25660.4361280.76860.3991280.51250.3941280.7685
340.1610.118320.15960.210160.23970.172160.15930.140160.2393
350.0870.2071280.85350.01380.16060.07340.05350.04040.0805
360.0320.111640.42780.05180.16070.076640.85330.06340.0804
370.0060.044320.21390.025320.64090.0031281.70730.00980.1603
380.0000.108640.42790.063320.64090.00040.05330.00020.0403
390.4000.3241280.29770.3431280.89160.305160.07440.324320.2234
400.1380.162320.01270.1761280.14280.13080.00640.11980.0093
410.6300.62010240.93860.5971280.35270.58410241.87570.568320.0883
420.4840.37580.01650.3751280.75050.375320.12530.375160.0944
430.3190.2355120.25670.2105120.76870.30320.00230.202640.0963
440.6580.6781280.068100.6711280.20490.66440.00440.65340.0067
450.0000.3291280.21350.3331280.63960.31740.01350.22540.0204
460.0000.31780.01380.382320.16050.325320.10640.317320.1604
470.4380.432320.12580.458640.75070.4201281.00050.4161281.5004
480.0450.077160.66790.10981.00080.044161.33330.047162.0004
490.5070.5285120.71180.541160.06780.456160.04440.419160.0674
500.2460.148640.10070.1971280.60350.197160.05060.14880.0384
510.4250.3702560.80360.32980.07560.35680.05060.274160.1504
520.0670.66720.00430.66720.01330.067160.07130.06720.0133
530.3160.322640.64670.309320.97090.325320.64650.325641.9396
540.0550.592160.667100.66581.00090.089161.33330.087162.0003
550.4810.42920.02530.42940.15030.43520.05030.468321.2003
560.2340.368640.80080.419160.60040.237641.60050.265160.6005
570.4870.597640.80060.56580.30070.49480.20030.48180.3003
580.1210.149160.19050.215160.57160.118320.76250.125321.1436
590.1710.4485120.683100.7921280.512100.183320.08540.181160.0645
600.1200.4085120.683100.6731280.512100.1151280.34150.117160.0648
610.1330.83320.00430.83320.01130.1001280.44930.100640.3373
620.4790.455320.07560.438640.45080.4552561.19950.442160.1123
630.2390.357320.40050.38340.15030.220641.60040.227321.2004
640.8910.908640.06380.90540.01260.9051280.25080.87840.0123
650.9090.9331280.06480.933640.09660.92880.00830.81720.0033
660.7120.861640.03250.8755120.76830.841320.03240.64920.0033
670.8610.90410240.51250.923640.09640.904640.06430.86120.0033
680.0380.0481280.88990.105160.33380.029320.44430.00080.1673
690.0220.0721280.88960.072320.66760.044320.44450.033320.6676
700.2150.537320.40060.42420.07560.180641.60030.176160.6003
710.1920.29280.10060.289160.60040.144641.60030.131321.2003
720.2930.976640.80040.74620.07570.278641.60030.24480.3003
730.6050.608160.02250.632320.13350.58120.00630.52080.0333
740.3600.18010240.36040.2202560.27040.160320.02330.14010241.0804
750.6400.597160.02260.573320.13380.576160.04430.55520.0083
760.1020.193320.02180.31040.00870.27840.00550.053320.0645
770.4020.471640.04390.66940.008100.51140.00570.211320.0647
780.2090.287640.04390.52940.00890.47640.00560.116320.0645
790.4610.42880.01640.411640.38450.4061280.51260.3611280.7684
800.2480.2785121.000100.290640.37590.247160.06340.232320.1883
810.6590.533640.08970.547160.06750.34740.01160.36540.0176
820.0470.24080.53380.27340.80070.16720.26730.06040.8003
830.3040.306640.91460.14680.34340.303641.82940.24380.3433
840.1410.120640.98560.188160.73850.143320.98550.136160.73810
850.0540.3541280.54540.354640.81740.038640.54530.043320.4093
860.2110.4081281.000100.440320.750100.208320.50040.125160.3755
870.1010.1371280.32290.1921280.96580.104160.08050.09580.0607
880.1200.047160.26780.147160.80080.10080.26780.050160.8008
890.3200.311640.23160.37380.08750.30780.05840.24640.0433
900.1920.1231280.37370.177640.56040.138160.09350.123640.5605
910.2400.380320.11660.24040.04470.160640.46530.00020.0223
920.2530.31180.09870.25480.29370.166641.56140.05740.1463
930.0930.039160.12590.21740.094100.063160.25080.04880.1886
940.1940.194160.10790.16080.16060.208160.21340.01440.0803
950.0050.0021280.84260.002320.63270.003320.42150.003320.6327
960.3890.50020.00930.50020.02630.407640.54730.426160.2053
970.3820.381640.23780.384640.711100.382160.11970.381160.1784
980.5450.4812560.28440.5581280.42740.468320.07140.442320.1074
990.3900.3515120.56940.3381280.42750.3125121.13840.31280.0274
1000.1700.198640.150100.174640.451100.176640.30060.166160.1135
Table 4. The Wilcoxon signed ranks test results of the SAX-BD vs. other methods. A p-value less than or equal to 0.05 indicates a significant improvement. n* means positive, n means equal and n0 means negative. The larger the value of n*, the better performance of SAX-TD.
Table 4. The Wilcoxon signed ranks test results of the SAX-BD vs. other methods. A p-value less than or equal to 0.05 indicates a significant improvement. n* means positive, n means equal and n0 means negative. The larger the value of n*, the better performance of SAX-TD.
Methodsn*nn0p-Value
SAX-BD vs. Euclidean79156p < 0.05
SAX-BD vs. SAX83116p < 0.05
SAX-BD vs. ESAX87103p < 0.05
SAX-BD vs. SAX-TD692210p < 0.05
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

He, Z.; Long, S.; Ma, X.; Zhao, H. A Boundary Distance-Based Symbolic Aggregate Approximation Method for Time Series Data. Algorithms 2020, 13, 284. https://doi.org/10.3390/a13110284

AMA Style

He Z, Long S, Ma X, Zhao H. A Boundary Distance-Based Symbolic Aggregate Approximation Method for Time Series Data. Algorithms. 2020; 13(11):284. https://doi.org/10.3390/a13110284

Chicago/Turabian Style

He, Zhenwen, Shirong Long, Xiaogang Ma, and Hong Zhao. 2020. "A Boundary Distance-Based Symbolic Aggregate Approximation Method for Time Series Data" Algorithms 13, no. 11: 284. https://doi.org/10.3390/a13110284

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop