Bispectral-based methods for clustering time series

doi:10.1016/j.csda.2013.03.001

Computational Statistics & Data Analysis

Volume 64, August 2013, Pages 113-131

https://doi.org/10.1016/j.csda.2013.03.001 Get rights and content

Abstract

Distinguishing among linear and nonlinear time series or between nonlinear time series generated by different underlying processes is challenging, as second-order properties are generally insufficient for the task. Different nonlinear processes have different nonconstant bispectral signatures, whereas the bispectral density function of a Gaussian or linear time series is constant. Based on this, we propose a procedure to distinguish among various nonlinear time series and between nonlinear and linear time series through application of a hierarchical clustering algorithm based on distance measures computed from the square modulus of the estimated normalized bispectra. We find that clustering using a distance measure computed by averaging the ratio of normalized bispectral periodogram ordinates over the intersection of the principle domain of each pair of time series provides good performance, subject to trimming of extreme bispectral values prior to taking the ratios. Additionally, we show through simulation studies that the distance procedure performs better than a significance test that we derive. Moreover, it is robust with respect to the choice of smoothing parameter in estimating the bispectrum. As an example, we apply the method to a set of time series of intensities of gamma-ray bursts, some of which exhibit nonlinear behavior; this enables us to identify gamma-ray bursts that may be emanating from the same type of astral event.

Introduction

Identifying time series having similar underlying properties is useful in many applications, such as financial portfolio selection, patient health monitoring, process engineering, etc. Series determined to be similar can then be further analyzed to understand underlying causes of the common structure and gain greater insight. Liao (2005) provides a survey of many time series clustering applications. For time series data, multiple authors have studied methods to group series having similar second-order properties using both model-based techniques and nonparametric techniques, such as those based on spectral or wavelet properties. See, for example, Fräuhwirth-Schnatter and Kaufmann (2008) for a discussion of model-based clustering of multiple time series and Shumay and Stoffer (2006) for a discussion of second-order spectral-based clustering of time series. However, second-order properties are generally insufficient for fully describing the dependence structure of nonlinear time series. Often third-moment properties, and in particular the bispectral density function, will contain signatures of nonlinear behavior. In this paper, we address the problem of grouping multiple independent time series, some of which are nonlinear. We propose a method for clustering series based on similarities of the square modulus of the estimated normalized bispectral density function and examine the effectiveness of the method through a simulation study. While an estimated bispectrum has been used previously for testing nonlinear serial dependence in time series data, e.g., Rusticelli et al. (2009), the estimated bispectrum has not previously been used to construct distance measures for clustering techniques. As an example, we apply the bispectral-based clustering technique to a set of gamma-ray burst (GRB) intensity time series, some of which exhibit nonlinear behavior, for the purpose of identifying bursts that may be emanating from the same type of astral event. The resulting clusters of gamma-ray bursts may provide clues for astrophysicists to understand the origins of these bursts.

The remainder of the paper is organized as follows. Section 2 provides an overview of the bispectrum of a time series. Section 3 presents a significance test for the equality of bispectra from two independent series. In Section 4, we formulate several distance measures based on the bispectra and investigate their performance in simulation studies when used for agglomerative clustering of nonlinear times series. Section 5 provides an application of the clustering method to GRB time series, while Section 6 summarizes our results and suggests some areas for further research.

Section snippets

Background on bispectral analysis

Let ${ε_{t} : t \in Z}$ represent a discrete-time white noise process with variance $σ_{ε}^{2}$ . The series ${X_{t} : t \in Z}$ is a causal linear time series if it admits the representation $X_{t} = \sum_{j = 0}^{\infty} β_{j} ε_{t - j} for \sum_{j = 0}^{\infty} | β_{j} | < \infty .$ For any zero-mean, third-order stationary series, the autocovariance and third-order moment functions are defined as $γ_{v} = E [X_{t} X_{t + v}] and γ_{u, v} = E [X_{t} X_{t + u} X_{t + v}],$ respectively. The series ${X_{t}}$ need be only second-order stationary with an absolutely summable autocovariance function for the spectral density function to

A significance test of equivalent spectra

For two independent linear processes ${X_{t}}$ and ${Y_{t}}$ with respective second-order spectra $I_{X} (ω)$ and $I_{Y} (ω)$ , Coates and Diggle (1986) derived a test of $H_{0} : I_{X} (ω) = I_{Y} (ω)$ that is similar to the maximum periodogram test of Fisher (1929). We extend their approach to construct a significance test for $H_{0} : I_{X} (ω_{1}, ω_{2}) = I_{Y} (ω_{1}, ω_{2})$ , where now the processes ${X_{t}}$ and ${Y_{t}}$ are not necessarily linear.

Let the series $x_{t}$ and $y_{t}, t = 1, 2, \dots, n$ be realizations of length $n$ of the processes ${X_{t}}$ and ${Y_{t}}$ . Denote the estimated

Clustering algorithm

Consider a set of $L$ time series, each being a realization of length $n$ from one of $P \leq L$ different processes. For series $ℓ = 1, 2, \dots, L$ , denote the estimated normalized bispectrum in (3) by ${\hat{Z}}_{2, ℓ}^{2} (ω_{j}, ω_{k})$ . For all $L (L - 1) / 2$ pairs of series, distinguish the “first” series using the subscript $ℓ_{1} = 1, 2, \dots, L - 1$ and the “second” series using $ℓ_{2} = 2, 3, \dots, L, ℓ_{1} \neq ℓ_{2}$ . The algorithm is as follows.

1.
For each series, compute ${\hat{Z}}_{2, ℓ}^{2} (ω_{j}, ω_{k})$ .
2.
For all $(ω_{j}, ω_{k}) \in D$ , compute $r_{ℓ_{1}, ℓ_{2}} (ω_{j}, ω_{k}) = \frac{{\hat{Z}}_{2, ℓ_{1}}^{2} (ω_{j}, ω_{k})}{{\hat{Z}}_{2, ℓ_{2}}^{2} (ω_{j}, ω_{k})}, and$ $a_{ℓ_{1}, ℓ_{2}} (ω_{j}, ω_{k}) = |$

An application: clustering gamma ray bursts

The following discussion on gamma ray bursts (GRBs) is a summary of resources found on-line, primarily on NASA’s web site and on a web site from Eastern Illinois University “EIU Astro” maintained by Professor James Conwell. The specific links are provided at the end of the references.

Gamma ray bursts were discovered serendipitously in the late 1960s by US military satellites watching for Soviet nuclear testing. Popular interest in them was re-ignited on December 27, 2004, when NASA and European

Comments and conclusion

The ability to successfully cluster sets of time series is a problem of interest in many scientific areas, including but not limited to astrophysics, economics, finance, population dynamics, biology, and meteorology. A theoretical requirement for the existence of the bispectrum is that the series is at least sixth-order stationary. It is arguable that it is not possible to show that a nonlinear process is second-order stationary, much less sixth-order stationary. So there is no guaranty that

Acknowledgments

The authors would like to express their appreciation to Dr. David Kahle for his invaluable assistance in creating the improved graphs of the bispectrum using ggplot2, and to the three referees and the associate editor, whose comments greatly improved the quality of the manuscript.

References (28)

J. Caiado et al.
A periodogram-based metric for time series classification
Computational Statistics & Data Analysis
(2006)
R.A. Ashley et al.
A diagnostic test for nonlinear serial dependence in time series fitting errors
Journal of Time Series Analysis
(1986)
A. Berg et al.
A bootstrap test for time series linearity
Journal of Statistical Planning and Inference
(2012)
D.R. Brillinger
An introduction to polyspectra
Annals of Mathematical Statistics
(1965)
D.R. Brillinger et al.
A symptotic theory of estimates of $k$ -th order spectra
J. Caiado et al.
Comparison of time series of unequal length in the frequency domain
Communications in Statistics: Simulation and Computation
(2009)
D.S. Coates et al.
Tests for comparing two estimated spectral densities
Journal of Time Series Analysis
(1986)
R.A. Fisher
Tests of significance in harmonic analysis
Proceedings of the Royal Society, Series A
(1929)
S. Fräuhwirth-Schnatter et al.
Model-based clustering of multiple time series
Journal of Business and Economic Statistics
(2008)
M. Hinich
Testing for Gaussianity and linearity of a stationary time series
Journal of Time Series Analysis
(1982)

N. Jahan et al.

Bispectral-based goodness-of-fit tests of Gaussianity and linearity of stationary time series

Communications in Statistics: Theory and Methods

(2008)

D. Johnson et al.

Applied Multivariate Statistical Analysis

(2007)

D.A. Jones

Nonlinear autoregressive processes

Proceedings of the Royal Society of London: Series A

(1978)

T.W. Liao

Clustering of time series data: a survey

Pattern Recognition

(2005)

Cited by (13)

Clustering time-series by a novel slope-based similarity measure considering particle swarm optimization
2020, Applied Soft Computing Journal
Citation Excerpt :
The algorithms used for clustering can be categorized into two major groups namely evolutionary and non-evolutionary algorithms. Common non-evolutionary algorithms include but are not limited to k-means [4,11,14], I-k-means [12], k-harmonic means [47], hybrid fuzzy c-means and fuzzy c-medoids [1], kernel k-means [7], and some hierarchical methods [3,8,9]. The non-evolutionary algorithms usually provide poor results as they are dependent on the initial solution especially when dealing with high dimensional data [15].
Recently there has been an increase in the studies on time-series data mining specifically time-series clustering due to the vast existence of time-series in various domains. The large volume of data in the form of time-series makes it necessary to employ techniques such as clustering to understand the data and to extract information and hidden patterns. The most important aspect of time-series clustering is the similarity measure used to compare a pair of time-series. In this paper, we develop a new similarity measure specifically for the task of time-series clustering. The proposed similarity measure is developed based on a combination of a simple representation of time-series, slope of each segment of time-series, and Euclidean distance with the capability to be implemented by the so-called dynamic time warping. We prove in this paper that the proposed distance measure is metric and thus indexing can be applied. For the task of clustering, the Particle Swarm Optimization algorithm is employed. We evaluate the proposed similarity measure by comparing it to three well-known existing similarity measures in terms of various criteria used for the evaluation of clustering performances. The results indicate that the proposed similarity measure outperforms the rest in almost every dataset used in this paper.
Piecewise aggregate representations and lower-bound distance functions for multivariate time series
2015, Physica A: Statistical Mechanics and its Applications
Citation Excerpt :
Clustering is one of the most important methods in time series data mining. In particular, hierarchical clustering [39] is often used to test the performance of feature representations and their corresponding distance functions. In this experiment, let four methods (DTW, Euclidean, MPAA and MSAX) perform on the two datasets.
Dimensionality reduction is one of the most important methods to improve the efficiency of the techniques that are applied to the field of multivariate time series data mining. Due to multivariate time series with the variable-based and time-based dimensions, the reduction techniques must take both of them into consideration. To achieve this goal, we use a center sequence to represent a multivariate time series so that the new sequence can be seen as a univariate time series. Thus two sophisticated piecewise aggregate representations, including piecewise aggregate approximation and symbolization applied to univariate time series, are used to further represent the extended sequence that is derived from the center one. Furthermore, some distance functions are designed to measure the similarity between two representations. Through being proven by some related mathematical analysis, the proposed functions are lower bound on Euclidean distance and dynamic time warping. In this way, false dismissals can be avoided when they are used to index the time series. In addition, multivariate time series with different lengths can be transformed into the extended sequences with equal length, and their corresponding distance functions can measure the similarity between two unequal-length multivariate time series. The experimental results demonstrate that the proposed methods can reduce the dimensionality, and their corresponding distance functions satisfy the lower-bound condition, which can speed up the calculation of similarity search and indexing in the multivariate time series datasets.
Polarization of forecast densities: A new approach to time series classification
2014, Computational Statistics and Data Analysis
Citation Excerpt :
Time series classification techniques have been applied in a wide range of fields. For applications in economics and finance, see Liu and Maharaj (2013), Salcedo et al. (2012), Maharaj and D’Urso (2010), Miskiewicz and Ausloos (2008), Ausloos and Lambiotte (2007), Dose and Cincotti (2005), Basalto et al. (2005) and Pattarin et al. (2004); for environmental applications see Macchiato et al. (1995), Cowpertwait and Cox (1992); for gene studies see Douzal-Chouakria et al. (2009), Liu et al. (2008), Park et al. (2008), Scrucca (2007), Liang (2007) and Kim et al. (2006); for applications in health sciences see Alonso et al. (2012), Slaets et al. (2012) and Volant et al. (2012); for studies in astronomy, see Harvill et al. (2013). Most approaches to time series classification are either feature-based or model-based, producing classification solutions based on historical or current information extracted from time series.
Time series classification has been extensively explored in many fields of study. Most methods are based on the historical or current information extracted from data. However, if interest is in a specific future time period, methods that directly relate to forecasts of time series are much more appropriate. An approach to time series classification is proposed based on a polarization measure of forecast densities of time series. By fitting autoregressive models, forecast replicates of each time series are obtained via the bias-corrected bootstrap, and a stationarity correction is considered when necessary. Kernel estimators are then employed to approximate forecast densities, and discrepancies of forecast densities of pairs of time series are estimated by a polarization measure, which evaluates the extent to which two densities overlap. Following the distributional properties of the polarization measure, a discriminant rule and a clustering method are proposed to conduct the supervised and unsupervised classification, respectively. The proposed methodology is applied to both simulated and real data sets, and the results show desirable properties.
An irregularly spaced first-order moving average model
2021, arXiv
Novel features for binary time series based on branch length similarity entropy
2021, Entropy
Nonlinear time series classification using bispectrum-based deep convolutional neural networks
2020, Applied Stochastic Models in Business and Industry

View all citing articles on Scopus

View full text

Bispectral-based methods for clustering time series

Abstract

Introduction

Section snippets

Background on bispectral analysis

A significance test of equivalent spectra

Clustering algorithm

An application: clustering gamma ray bursts

Comments and conclusion

Acknowledgments

Computational Statistics & Data Analysis

A diagnostic test for nonlinear serial dependence in time series fitting errors

Journal of Time Series Analysis

A bootstrap test for time series linearity

Journal of Statistical Planning and Inference

An introduction to polyspectra

Annals of Mathematical Statistics

A symptotic theory of estimates of k-th order spectra

Comparison of time series of unequal length in the frequency domain

Communications in Statistics: Simulation and Computation

Tests for comparing two estimated spectral densities

Journal of Time Series Analysis

Tests of significance in harmonic analysis

Proceedings of the Royal Society, Series A

Model-based clustering of multiple time series

Journal of Business and Economic Statistics

Testing for Gaussianity and linearity of a stationary time series

Journal of Time Series Analysis

Bispectral-based goodness-of-fit tests of Gaussianity and linearity of stationary time series

Communications in Statistics: Theory and Methods

Applied Multivariate Statistical Analysis

Nonlinear autoregressive processes

Proceedings of the Royal Society of London: Series A

Clustering of time series data: a survey

Pattern Recognition

A symptotic theory of estimates of $k$ -th order spectra