Elsevier

Pattern Recognition Letters

Volume 32, Issue 13, 1 October 2011, Pages 1659-1668
Pattern Recognition Letters

Improving the classification accuracy of streaming data using SAX similarity features

https://doi.org/10.1016/j.patrec.2011.06.025Get rights and content

Abstract

The classification accuracy of time series is highly dependent on the quality of used features. In this study, features of new type, called SAX (Symbolic Aggregate approXimation) similarity features, are presented. SAX similarity features are a combination of the traditional statistical number-based and the template-based classification. SAX similarity features are obtained from the data of the time window by first transforming the time series into a discrete presentation using SAX. Then the similarity between this SAX presentation and predefined SAX templates are calculated, and these similarity values are considered as SAX similarity features. The functioning of these features was tested using five different activity data sets collected using wearable inertial sensors and five different classifiers. The results show that the recognition rates calculated using SAX similarity features together with traditional features are much better than those obtained employing traditional features only. In 20 tested cases out of 23, the improvement is statistically significant according to the paired t-test.

Highlights

► The study presents features of new type, called SAX (Symbolic Aggregate approXimation) similarity features. ► SAX similarity features are a combination of the traditional statistical number-based and the template-based classification. ► Features are tested using five different data sets and classifiers. ► In 20 tested cases out of 23 the presented method improved classification accuracy significantly according to paired t-test.

Introduction

In this paper, we present a novel method for the purpose of calculating features for real-time classification of streaming activity data using a sliding window-technique (Babcock et al., 2002). In this technique, the data are divided into certain fixed-length time windows, from which features are extracted. Classification is done on the basis of these features using a machine learning algorithm. Normally, the extracted features are basic features, such as time or frequency domain features, statistical numbers, or correlations. In this study, the features represent the similarity between the SAX presentation of the time window, which is a symbolic presentation of the time series, and predefined SAX templates. Previously, classification was done either using basic features or templates-based recognition, but the method presented here combines these techniques in a novel way.

The contributions of our paper are as follows:

  • We propose features of a new type, called SAX similarity features.

  • We use five different data sets and classifiers to show that a combination of SAX similarity features and basic features improves the classification accuracy; and

  • We show that this improvement is statistically significant according to paired t-test.

SAX similarity features are calculated in three steps. First, the time window studied is divided into n equal-sized sub-windows, and from each n sub-window, a statistical number is extracted. This way, the length of the time window is compressed to the size of n. Secondly, the values obtained are transformed into a discrete representation to get the data into SAX form (Lin et al., 2007). Finally, the SAX similarity feature is extracted by calculating the similarity between the obtained SAX series and the predefined SAX template. This SAX template must also consist of n different symbols; the similarity can be calculated using a string matching measure. In theory, with this method, it is possible to calculate an infinite number of features by using different values of n and mapping the time series to a discrete representation with different vocabularies.

The accuracy and efficiency of SAX similarity features are shown with five different activity data sets and five different classifiers. The data sets were collected using wearable inertial sensors attached to subject’s wrist or both wrists and the data sets included from four to eight different human actions and activities. Using these data sets, the functionality of SAX similarity features is tested and the statistical significancy of the results is shown using the paired t-test.

The article contains the following sections: Section 2 presents related work. Section 3 introduces SAX similarity features and methods for calculating them. Section 4 evaluates the performance and accuracy of the proposed method with five different data sets and classifiers. Finally, the discussion and conclusions can be found in Section 5.

Section snippets

Related work

Symbolic time series analysis (Lind and Marcus, 1995) extracts relevant information from signal to generate symbol sequences (Rajagopalan and Ray, 2006). It can for example be used to model motion; therefore, symbolic time series analysis can be applied to detect anomalities (Chin et al., 2005), for instance. In this article, it is employed to model human motion measured using inertial sensors. The method we used for the transformation is called SAX (Symbolic Aggregate approXimation) (Lin et

Methods

This article presents a novel method for calculating features used for real-time classification of streaming data. The presented method combines template-based recognition and traditional feature based recognition in which the information of time series is summarized as one single value. Here, traditional features are called basic features. Features like that are usually simple and fast to calculate; therefore, they are useful in real-time applications. In this study, the classification

Study

The SAXS features were tested using five different data sets. Classification of these data sets was done using different classifiers presented in Section 4.3, the results are shown in Section 4.4.

Discussion and conclusions

In this study, new types of features, called SAX similarity (SAXS) features, were presented. These features were tested using five different data sets and five different classifiers. The classification accuracy was calculated using three different feature sets, SAXS feature set, basic feature set (including statistical numbers, frequency domain features and correlations) and a combination of SAXS and basic features. The results are promising (see Fig. 6). More specifically, they show that, by

Acknowledgments

This study was carried out with financial support from the Sixth Framework Programme of the European Community for research, technological development and demonstration activities in an XPRESS (FleXible Production Experts for reconfigurable aSSembly technology) project. It does not necessarily reflect the Commission’s views and in no way anticipates the Commission’s future policy in this area.

Pekka Siirtola would like to thank GETA (The Graduate School in Electronics, Telecommunications and

References (37)

  • S.C. Chin et al.

    Symbolic time series analysis for anomaly detection: A comparative evaluation

    Signal Process.

    (2005)
  • V. Rajagopalan et al.

    Symbolic time series analysis via wavelet-based partitioning

    Signal Process.

    (2006)
  • B. Babcock et al.

    Models and issues in data stream systems

  • L. Bao et al.

    Activity recognition from user-annotated acceleration data

  • Boughorbel, S., 2010. Child-activity recognition from multi-sensor data. In: 7th International Conference on Methods...
  • da Silva, J., Klusch, M., 2007. Privacy preserving pattern discovery in distributed time series. In: Data Engineering...
  • M. Ermes et al.

    Detection of daily activities and sports with wearable sensors in controlled and uncontrolled conditions

    Inf. Technol. Biomed., IEEE Trans.

    (2008)
  • Fix, E., Hodges, J.L., 1951. Discriminatory analysis: Nonparametric discrimination: Consistency properties. Tech. Rep....
  • Haapalainen, E., Laurinen, P., Junno, H., Tuovinen, L., Röning, J., 2006. Feature selection for identification of spot...
  • D.J. Hand et al.

    Principles of Data Mining

    (2001)
  • J. Hunter et al.

    Feature extraction from sensor data streams for real-time human behaviour recognition

  • Kätevä, J., Laurinen, P., Rautio, T., Tuovinen, L., Suutala, J., Röning, J., 2010. Dbsa – a device-based software...
  • E. Keogh et al.

    Finding unusual medical time-series subsequences: Algorithms and applications

    Inf. Technol. Biomed., IEEE Trans.

    (2006)
  • Koskimäki, H., Huikari, V., Siirtola, P., Laurinen, P., Röning, J., 2009. Activity recognition using a wrist-worn...
  • N.C. Krishnan et al.

    Recognition of hand movements using wearable accelerometers

    J. Ambient Intell. Smart Environ.

    (2009)
  • J. Lin et al.

    Experiencing sax: A novel symbolic representation of time series

    Data Min. Knowl. Disc.

    (2007)
  • D. Lind et al.

    An Introduction to Symbolic Dynamics and Coding

    (1995)
  • Lkhagva, B., Suzuki, Y., Kawagoe, K., 2006. New time series data representation esax for financial applications. In:...
  • Cited by (24)

    • Learning from heterogeneous temporal data in electronic health records

      2017, Journal of Biomedical Informatics
      Citation Excerpt :

      In this section, we provide the necessary background definitions and pointers to related literature. To analyze time series data, the SAX representation has been widely used on both univariate and multivariate time series, see e.g. [44–46]. In a closer relation to the problem described in this study, several studies use SAX to create features out of time series data.

    • Piecewise aggregate representations and lower-bound distance functions for multivariate time series

      2015, Physica A: Statistical Mechanics and its Applications
      Citation Excerpt :

      The symbolic conversion is indeed the algorithm of SAX. Siirtola et al. [31] proposed a new method to improve the classification accuracy streaming data based on SAX.

    • Time series visualization based on shape features

      2013, Knowledge-Based Systems
      Citation Excerpt :

      It allows time series of arbitrary length m to be reduced to a string of arbitrary length w, where w < m. In the past few years, SAX and PAA have been two of the most popular representations for time series data mining, including clustering [22], classification [12,20], pattern discovery [21,29] and visualization [4] in time series datasets. The extension of SAX (ESAX) [18] considering the minimum and maximum values of subsequence also can be applied to similarity search.

    • SDN-Based LDoS Mitigation System

      2023, 2023 International Conference on Artificial Intelligence and Computer Information Technology, AICIT 2023
    View all citing articles on Scopus
    View full text