Bispectral-based methods for clustering time series

https://doi.org/10.1016/j.csda.2013.03.001Get rights and content

Abstract

Distinguishing among linear and nonlinear time series or between nonlinear time series generated by different underlying processes is challenging, as second-order properties are generally insufficient for the task. Different nonlinear processes have different nonconstant bispectral signatures, whereas the bispectral density function of a Gaussian or linear time series is constant. Based on this, we propose a procedure to distinguish among various nonlinear time series and between nonlinear and linear time series through application of a hierarchical clustering algorithm based on distance measures computed from the square modulus of the estimated normalized bispectra. We find that clustering using a distance measure computed by averaging the ratio of normalized bispectral periodogram ordinates over the intersection of the principle domain of each pair of time series provides good performance, subject to trimming of extreme bispectral values prior to taking the ratios. Additionally, we show through simulation studies that the distance procedure performs better than a significance test that we derive. Moreover, it is robust with respect to the choice of smoothing parameter in estimating the bispectrum. As an example, we apply the method to a set of time series of intensities of gamma-ray bursts, some of which exhibit nonlinear behavior; this enables us to identify gamma-ray bursts that may be emanating from the same type of astral event.

Introduction

Identifying time series having similar underlying properties is useful in many applications, such as financial portfolio selection, patient health monitoring, process engineering, etc. Series determined to be similar can then be further analyzed to understand underlying causes of the common structure and gain greater insight. Liao (2005) provides a survey of many time series clustering applications. For time series data, multiple authors have studied methods to group series having similar second-order properties using both model-based techniques and nonparametric techniques, such as those based on spectral or wavelet properties. See, for example, Fräuhwirth-Schnatter and Kaufmann (2008) for a discussion of model-based clustering of multiple time series and Shumay and Stoffer (2006) for a discussion of second-order spectral-based clustering of time series. However, second-order properties are generally insufficient for fully describing the dependence structure of nonlinear time series. Often third-moment properties, and in particular the bispectral density function, will contain signatures of nonlinear behavior. In this paper, we address the problem of grouping multiple independent time series, some of which are nonlinear. We propose a method for clustering series based on similarities of the square modulus of the estimated normalized bispectral density function and examine the effectiveness of the method through a simulation study. While an estimated bispectrum has been used previously for testing nonlinear serial dependence in time series data, e.g., Rusticelli et al. (2009), the estimated bispectrum has not previously been used to construct distance measures for clustering techniques. As an example, we apply the bispectral-based clustering technique to a set of gamma-ray burst (GRB) intensity time series, some of which exhibit nonlinear behavior, for the purpose of identifying bursts that may be emanating from the same type of astral event. The resulting clusters of gamma-ray bursts may provide clues for astrophysicists to understand the origins of these bursts.

The remainder of the paper is organized as follows. Section 2 provides an overview of the bispectrum of a time series. Section 3 presents a significance test for the equality of bispectra from two independent series. In Section 4, we formulate several distance measures based on the bispectra and investigate their performance in simulation studies when used for agglomerative clustering of nonlinear times series. Section 5 provides an application of the clustering method to GRB time series, while Section 6 summarizes our results and suggests some areas for further research.

Section snippets

Background on bispectral analysis

Let {εt:tZ} represent a discrete-time white noise process with variance σε2. The series {Xt:tZ} is a causal linear time series if it admits the representation Xt=j=0βjεtjforj=0|βj|<. For any zero-mean, third-order stationary series, the autocovariance and third-order moment functions are defined as γv=E[XtXt+v]andγu,v=E[XtXt+uXt+v], respectively. The series {Xt} need be only second-order stationary with an absolutely summable autocovariance function for the spectral density function to

A significance test of equivalent spectra

For two independent linear processes {Xt} and {Yt} with respective second-order spectra IX(ω) and IY(ω), Coates and Diggle (1986) derived a test of H0:IX(ω)=IY(ω) that is similar to the maximum periodogram test of Fisher (1929). We extend their approach to construct a significance test for H0:IX(ω1,ω2)=IY(ω1,ω2), where now the processes {Xt} and {Yt} are not necessarily linear.

Let the series xt and yt,t=1,2,,n be realizations of length n of the processes {Xt} and {Yt}. Denote the estimated

Clustering algorithm

Consider a set of L time series, each being a realization of length n from one of PL different processes. For series =1,2,,L, denote the estimated normalized bispectrum in (3) by Zˆ2,2(ωj,ωk). For all L(L1)/2 pairs of series, distinguish the “first” series using the subscript 1=1,2,,L1 and the “second” series using 2=2,3,,L,12. The algorithm is as follows.

  • 1.

    For each series, compute Zˆ2,2(ωj,ωk).

  • 2.

    For all (ωj,ωk)D, compute r1,2(ωj,ωk)=Zˆ2,12(ωj,ωk)Zˆ2,22(ωj,ωk),anda1,2(ωj,ωk)=|

An application: clustering gamma ray bursts

The following discussion on gamma ray bursts (GRBs) is a summary of resources found on-line, primarily on NASA’s web site and on a web site from Eastern Illinois University “EIU Astro” maintained by Professor James Conwell. The specific links are provided at the end of the references.

Gamma ray bursts were discovered serendipitously in the late 1960s by US military satellites watching for Soviet nuclear testing. Popular interest in them was re-ignited on December 27, 2004, when NASA and European

Comments and conclusion

The ability to successfully cluster sets of time series is a problem of interest in many scientific areas, including but not limited to astrophysics, economics, finance, population dynamics, biology, and meteorology. A theoretical requirement for the existence of the bispectrum is that the series is at least sixth-order stationary. It is arguable that it is not possible to show that a nonlinear process is second-order stationary, much less sixth-order stationary. So there is no guaranty that

Acknowledgments

The authors would like to express their appreciation to Dr. David Kahle for his invaluable assistance in creating the improved graphs of the bispectrum using ggplot2, and to the three referees and the associate editor, whose comments greatly improved the quality of the manuscript.

References (28)

  • J. Caiado et al.

    A periodogram-based metric for time series classification

    Computational Statistics & Data Analysis

    (2006)
  • R.A. Ashley et al.

    A diagnostic test for nonlinear serial dependence in time series fitting errors

    Journal of Time Series Analysis

    (1986)
  • A. Berg et al.

    A bootstrap test for time series linearity

    Journal of Statistical Planning and Inference

    (2012)
  • D.R. Brillinger

    An introduction to polyspectra

    Annals of Mathematical Statistics

    (1965)
  • D.R. Brillinger et al.

    A symptotic theory of estimates of k-th order spectra

  • J. Caiado et al.

    Comparison of time series of unequal length in the frequency domain

    Communications in Statistics: Simulation and Computation

    (2009)
  • D.S. Coates et al.

    Tests for comparing two estimated spectral densities

    Journal of Time Series Analysis

    (1986)
  • R.A. Fisher

    Tests of significance in harmonic analysis

    Proceedings of the Royal Society, Series A

    (1929)
  • S. Fräuhwirth-Schnatter et al.

    Model-based clustering of multiple time series

    Journal of Business and Economic Statistics

    (2008)
  • M. Hinich

    Testing for Gaussianity and linearity of a stationary time series

    Journal of Time Series Analysis

    (1982)
  • N. Jahan et al.

    Bispectral-based goodness-of-fit tests of Gaussianity and linearity of stationary time series

    Communications in Statistics: Theory and Methods

    (2008)
  • D. Johnson et al.

    Applied Multivariate Statistical Analysis

    (2007)
  • D.A. Jones

    Nonlinear autoregressive processes

    Proceedings of the Royal Society of London: Series A

    (1978)
  • T.W. Liao

    Clustering of time series data: a survey

    Pattern Recognition

    (2005)
  • Cited by (13)

    • Clustering time-series by a novel slope-based similarity measure considering particle swarm optimization

      2020, Applied Soft Computing Journal
      Citation Excerpt :

      The algorithms used for clustering can be categorized into two major groups namely evolutionary and non-evolutionary algorithms. Common non-evolutionary algorithms include but are not limited to k-means [4,11,14], I-k-means [12], k-harmonic means [47], hybrid fuzzy c-means and fuzzy c-medoids [1], kernel k-means [7], and some hierarchical methods [3,8,9]. The non-evolutionary algorithms usually provide poor results as they are dependent on the initial solution especially when dealing with high dimensional data [15].

    • Piecewise aggregate representations and lower-bound distance functions for multivariate time series

      2015, Physica A: Statistical Mechanics and its Applications
      Citation Excerpt :

      Clustering is one of the most important methods in time series data mining. In particular, hierarchical clustering [39] is often used to test the performance of feature representations and their corresponding distance functions. In this experiment, let four methods (DTW, Euclidean, MPAA and MSAX) perform on the two datasets.

    • Polarization of forecast densities: A new approach to time series classification

      2014, Computational Statistics and Data Analysis
      Citation Excerpt :

      Time series classification techniques have been applied in a wide range of fields. For applications in economics and finance, see Liu and Maharaj (2013), Salcedo et al. (2012), Maharaj and D’Urso (2010), Miskiewicz and Ausloos (2008), Ausloos and Lambiotte (2007), Dose and Cincotti (2005), Basalto et al. (2005) and Pattarin et al. (2004); for environmental applications see Macchiato et al. (1995), Cowpertwait and Cox (1992); for gene studies see Douzal-Chouakria et al. (2009), Liu et al. (2008), Park et al. (2008), Scrucca (2007), Liang (2007) and Kim et al. (2006); for applications in health sciences see Alonso et al. (2012), Slaets et al. (2012) and Volant et al. (2012); for studies in astronomy, see Harvill et al. (2013). Most approaches to time series classification are either feature-based or model-based, producing classification solutions based on historical or current information extracted from time series.

    View all citing articles on Scopus
    View full text