A new similarity measure based on shape information for invariant with multiple distortions
Introduction
The research on similarity measure is one of the core aspects in time series data mining [1], [2], [3]. Almost every task of time series data mining requires a subtle notion of similarity between series [1], [2], [3], [5]. Due to the characteristics of noise and volatility, the two similar time series always appear in diverse kinds of distortions, which are usually seen as the combinations of the following five basic distortions: noise, amplitude scale, amplitude shifting, temporal scaling, and linear drift [3], [4], [6], [7], [8].
In recent years, hundreds of techniques have been designed to study the similarity measure with invariance under the mentioned basic distortions for time series [3], [4]. As a result, the similarity model has been extended in many different directions [3]: taking time warping into account [7], [8], [9], [11], [16], [17], [18], [19], [20], [21], [22], allowing amplitude shifting [7], [8], [22], allowing time series of different lengths [7], [8], [9], [11], [16], [17], [18], [19], [20], [21], [22], tolerating some degree of noise [7], [8], [10], [11], [15], [16], [17], [18], [21], and invariant to the complexity [6].
Many literatures often use the number of kinds of tolerable distortions to measure the performance of similarity measure [3], [6], [8]. The more kinds of distortions a similarity model tolerates, the more powerful the similarity model is. For example, according to the study results of [7], [8], ASEAL has been demonstrated to be superior to others used in the literature on ECG datasets for dealing with four basic distortions: noise, offset translation, amplitude scaling and time axis scaling. However, there are few literatures discussing the tolerable degree of a specific distortion in the evaluation of similarity measure; while in this paper, it is considered as an important index to measure the performance of similarity measure. Besides, most of the existing approaches take a toll to tune its parameters and compute [10], [11], [15], [16], [18], [20], [21], [22]. For instance, for EDR and LCSS measures [15], [16], [18], a threshold parameter is required to be set, which is difficult without a priori knowledge of the data.
Inspired by shape recognition, a novel similarity measure SIMshape is introduced to address multiple distortions in this paper. The number of kinds of distortions and the degree of distortion are used to measure the robustness of SIMshape. In order to provide a comprehensive validation, the experiments on the synthetic data and five real time series data from different domains have been conducted. The major contributions of this paper are the following:
- •
To record the most salient features of the original time series from different scales, a new symbolic approximate representation for time series is introduced. The representation is obtained through Multi-scale Discrete Haar Wavelet Transform, key point extraction, and symbolization. Unlike the traditional approximation methods, such as Discrete Fourier Transform (DFT) [23], Symbolic Aggregate Approximation (SAX) [24], [25] and Piecewise Aggregate Approximation (PAA) [27], it does not need to preset any parameter. The symbolization technique significantly reduces dimensionality. The essential characteristics of the original time series is retained by the application of Multi-scale Discrete Haar Wavelet Transform and key point extraction retain. Therefore, the multi-scale shape information assures the efficiency and effectiveness of SIMshape.
- •
To improve SIMshape robustness to various transformations, the scale-weight function for SIMshape is designed. It makes SIMshape emphasize the basic shape information of the original sequence, which is preserved in the coarse level. As the essential characteristics of time series are not altered by the degree of distortions mentioned, so the distortions have relatively little impact on the information in the coarse level. As a result, the robustness of SIMshape is improved by assigning bigger weighted values to the coarse level.
- •
To measure the similarity between two time series, a novel similarity measure SIMshape, based on the multi-scale shape information and the scale-weight function, is presented. SIMshape is parameter-free and easy to implement. Through a set of objective tests on synthetic data sets and five real time series data sets from different application domains, it can be found that SIMshape is more robust to various deformations than LB_keogh [10], [11], ED, CID [6] and ASEAL [7], [8], and more accurate than other four methods when applied to classify real time series.
The rest of this paper is organized as follows. In Section 2, the current known basic distortions for time series are reviewed, and related methods are cited and commented. In Section 3, a new similarity measure and its specific implementation is put forward. In Section 4, experiments in synthetic time series and real-world time series have been conducted to evaluate the proposed method. In Section 5, conclusions and some potential future work are given.
Section snippets
Problem statement
Suppose that there are two time series Q and its small distortion Qd, which means that the degree of distortion cannot alter the nature of Q. Specifically, Q and Qd have the same basic shape information ignoring the subtle difference, and they seem very similar to the human eye.
In this paper, the following five basic transformations [3], [4], [6], [7], [8] are considered. As shown in Fig. 1, these basic transformations are defined as follows:
- •
Noise: The noise distortion means that two time
The proposed method
The core contribution of this work is introduced in a position. To record the shape information in a different scale, a symbolic approximate representation based on multi-scale discrete wavelet transform and key points for time series is first proposed; then a new similar measure is defined, which is based on this representation and a scale-weight factor. The overview of the proposed method is shown in Fig. 2. To make the presentation of the proposed work clear, the description of various
Performance analysis
This section contains the experimental results to show the performance of SIMshape. To evaluate our proposed similarity SIMshape, two experiments are conducted on synthetic data sets and five real time series data sets from different domains. The synthetic data sets simulate the basic distortions and their deformation degree, and the real time series can reflect the combinations of five basic distortions. In the first experiment of the tolerance on five basic distortions, SIMshape is compared
Conclusions
The goal of this paper is to propose a robust measure of similarity that would be more robust to distortion. That is, if we have a sequence Q and modify it to sequence Qd by introducing distortions (such as noise, amplitude scaling, amplitude shifting, time scaling, linear shifting, and so on), the sequences Q and Qd should be considered reasonably similar through the judgement of the proposed similarity SIMshape. With this in mind, a novel similarity measure SIMshape is introduced to address
Acknowledgments
The author would like to thank the anonymous referees for their valuable critics and suggestions. Their comments greatly contributed to the quality enhancement of this work. This work is supported by the Natural Science Foundation of China (NSFC) under Grant no. 61174144, no. 61232018 and Grant no. 60874065.
Xiaoxu He received the Bachelor's degree in Computer Science from Shandong Normal University in 2009. She is to get the Ph.D. degree from Department of Computer Science and Technology, University of Science and Technology of China, in June 2014. Since September 2009, she has been a research in the Intelligent Qualitative and Virtual Reality Lab of University Science and Technology of China. Her research interest includes time series data mining (with emphasis on the representation and
References (55)
- et al.
A novel clustering method on time series data
Expert Syst. Appl.
(2011) Matching of quasi-periodic time series patterns by exchange of block-sorting signatures
Pattern Recognition Lett.
(2008)Reduced data similarity-based matching for time series patterns alignment
Pattern Recognition Lett.
(2010)- et al.
Bounded similarity querying for time-series data
Inf. Comput.
(2004) - et al.
Time series decomposition and measurement of business cycles, trends and growth cycles
J. Monetary Econ.
(2006) - et al.
Intelligent stock trading system by turning point confirming and probabilistic reasoning
Expert Syst. Appl.
(2008) A review on time series data mining
Eng. Appl. Artif. Intell.
(2011)- et al.
Time-series data mining
ACM Comput. Surv.
(2012) - et al.
Scaling and time warping in time series querying
VLDB J.
(2008) - et al.
Querying and mining of time series dataexperimental comparison of representations and distance measures
Proc. VLDB Endow.
(2008)
A Complexity-Invariant Distance Measure for Time Series
Exact indexing of dynamic time warping
Knowl. Inf. Syst.
Toward accurate dynamic time warping in linear time and space
Intell. Data Anal.
Indexing multidimensional time-series
VLDB J.
Similarity search on time series based on threshold queries
Lect. Notes Comput. Sci.
Cited by (11)
Towards adequate prediction of prediabetes using spatiotemporal ECG and EEG feature analysis and weight-based multi-model approach
2020, Knowledge-Based SystemsCitation Excerpt :We considered data from the signal in three-time phases throughout the experiment, which was implemented in three feature learning methods: PCA, ICA, LASSO, and PAA. Several feature learning methods have been proposed to perform feature selection automatically; for example, the approximate coefficient for discrete wavelet decomposition [53,54]. The approach is similar to the use of the eigenvalues coefficient in PCA for feature decomposition.
Cross-domain, soft-partition clustering with diversity measure and knowledge reference
2016, Pattern RecognitionEliminating the Effects of Jump Phenomenon for WiFi Driving Behavior Recognition
2022, ICEIEC 2022 - Proceedings of 2022 IEEE 12th International Conference on Electronics Information and Emergency CommunicationSpatio-temporal dual affine differential invariants for skeleton-based action recognition
2021, Journal of Image and GraphicsReview on geoscience time series big data similarity measurement and index method
2020, Bulletin of Geological Science and Technology
Xiaoxu He received the Bachelor's degree in Computer Science from Shandong Normal University in 2009. She is to get the Ph.D. degree from Department of Computer Science and Technology, University of Science and Technology of China, in June 2014. Since September 2009, she has been a research in the Intelligent Qualitative and Virtual Reality Lab of University Science and Technology of China. Her research interest includes time series data mining (with emphasis on the representation and similarity measure), nonlinear complex time series analyses, and uncertainty knowledge discovery.
Chenxi Shao received the Master's degree in Computer Science from University of Science and Technology of China (USTC), in 1995. He is now the Director of the Intelligent Qualitative and Virtual Reality Lab and an associate professor of school of computer science of USTC. His research interests mainly lie in the field of qualitative simulation, and non-stationary signal processing and non-linear complex system theory and their applications to biomedical and communication systems. He has published more than eighty research papers in those areas.
Yan Xiong received the B.S., M.S. and PhD in Computer Science from University of Science and Technology of China (USTC), in 1983, 1986 and 1990, respectively. He had been a post-doctoral fellow in the Department of Computer Science and Communication, University of Missouri-Kansas City (UMKC) from 1995 to 1997. He is now the Director of the computer network and information security laboratory and a professor of school of computer science of USTC. His research interests mainly contain computer network, information security, mobile computing, spatiotemporal information retrieval, data mining, mobile networks, and distributed processing.