Symbolic analysis of indicator time series by quantitative sequence alignment

https://doi.org/10.1016/j.csda.2008.08.033Get rights and content

Abstract

Symbolic analysis of economic indicators and derived time series offers an advantage of transferring quantitative values into qualitative notions by indexing intervals of numerical data with symbols. While differences in the numerical indicators are routinely measured by subtraction, differences in the symbolic indicators can be compared via more procedural quantitative-scoring schemes, the complexity of which depends on the alphabet size. In effect, the similarity of symbolic data sequence becomes a subtle measure. Upon motivating principles of symbolic analysis, our analysis illustrates how the optimized numerical scoring for alignment schemes may reveal functional and causal relations among the indicator data. The approach of symbolic analysis is particularly suitable for data processing in economics, in which partitioning of resources, competence, information access, or knowledge representation is common by the methodological design.

Introduction

The symbolic analysis of time series derived from the economic indicators offers a mechanism of partial noise removal by transferring the quantitative values (e.g. prices) into qualitative data via indexing a set of intervals with a set of symbols (e.g. price range labels), thereby reducing the Lebesgue measure of the original label set from a finite value to zero. Such a strategy is not uncommon; for instance, the computers process continuous problems in discrete manner through sequences represented by physical states of devices. In order to develop a meaningful symbolization filter in form of a customizable stencil applicable to numerical indicators, it is required to define the similarity of symbols, and the sequence information content. Such task is facilitated by the previous studies of sequence comparison in the mathematics and computer science, for instance the classical longest common subsequence algorithm (Hirschberg, 1977), along with a number of variants developed in the computational analysis of hereditary information (Setubal and Meidanis, 1997).

The symbolic analysis in broad sense of such collocation denotes the analysis of symbols, which is interdisciplinary within knowledge representation, and not attributable. Within the context of economic indicators, symbolic analysis enables gradual quantification of the similarity degree among various time series (and their subsets) until the maximal extent of the similarity measure becomes reached.

The above strategy adds value to data analysis in economics, which is quite different from the lossy data compression or de-noising algorithms (Egiazarian et al., 1999) in the computer science. The symbolization technique was previously motivated by complexity reduction in parameter space of trading strategies (Schittenkopf et al., 2002); however it does not confine to the extreme event analysis (Buhlmann, 1998) or decision making. Instead, the approach is designed to reveal the functional similarities and the causal relations in the optimal form. Let us notice here that the wavelet transform presents a powerful tool to optimize dynamic definitions of alphabets, since it efficiently implements adjustable quantization for both the time and frequency domains, and thus provides a mechanism for the accurate estimation of the complexity thresholds (Shin and Han, 2004).

Here we define the symbolic representation and similarity assessment for generic time series of real indicators. Our approach is then applied to the symbolized arrays of stock index data, using a procedure augmented with multiple scoring functions. The main strength of our approach consists in comparing the arbitrary time series and searching for causality relations, based on (1) the scoring of aligned symbol tuples (similarity counting) and (2) the gap insert operations (sequence editing). The notion of gap for the sequence edit operation implies admittance of the time delay (or speedup) in the propagation of economic effects (quantifiable with construction of the complete sequence alignment), which is the key point for applications in the causality recognition.

The article is organized as follows. Section 2 outlines the two sides of symbol definition: a priori layout of the data bins for intervals in the predicted range of indicator values, vs. a posteriori partitioning of indicator histogram. The scoring functions for symbol comparison are tailored and explained with a specific viewpoint on the indicators and gap insert (time delay). Section 3 quantifies the similarity of symbol arrays, and shows actual sequence alignment obtained with the dynamic programming algorithm. Applications and benchmarks on various time series of archetype indicators and market data are examined in Section 4. The concluding remarks in Section 5 show the directions of future research towards dynamical scoring functionals, and the methods of their optimization.

Section snippets

Symbolization and scoring

This section provides on overview of the basic symbolization procedure. First, both the ex-ante and ex-post approaches to the data symbolization are explained, and then a scoring scheme for the symbol comparison is stated.

Let It be the time series of economic indicator with the range of values in real numbers between I¯ and I¯, admitting ±. The symbolic representation of such indicator is built upon a countable set of indices j=1,,c, coupled with the delimiter values I¯I0I1I2Ic1IcI¯,

Similarity and sequence alignment

This section briefly summarizes the two-step dynamic programming procedure (Bellman, 1957), namely the recursive fill of a scoring matrix for substring alignments, and the reverse traceback procedure for generating the optimal alignment.

The alignment of the two symbol series is defined as two sequences S1[t] and S2[t] (t=1TT1+T2), where each sequence Si[t] contains the elements si[t] in preserved order, except for the possible gap inserts. The similarity measure of both sequences is

Applications and benchmarks

The formalism of symbolic analysis is first briefly benchmarked on the noise detection and signal positioning problem, dynamics regime extraction for GARCH processes, and the correlation study for imperfect pseudo-random time series. This framework is then applied to the comparison of the stock index data, in particular a priori alignment of BVSP index (Bovespa Brazil) with Dow Jones index, and an a posteriori alignment of Tokyo Stock Exchange 225 and Dow Jones indices.

Optimization of symbol definitions

In order to test the symbolic analysis from yet another angle, we have performed a full screening over all possible definitions of the five-letter alphabet used in this work. To this aim, we have selected the first N=100 elements of the Tokyo Stock Exchange Index 225 sequence starting on 01/04/1984, {Ri}, i=1,,N, where Ri are the log-normalized returns of the index on the daily basis. The alignment of the corresponding symbolic sequence {Si}, i=1,,N and a shifted sequence {Si} have been

Concluding remarks

We have developed the symbolic analysis approach to comparing the time series of indicators. Our method allows for aligning the subsets of time series, and rigorous quantification of the similarity degree for such alignments. It was thoroughly demonstrated how much the symbolic analysis is capable of the cross-market similarity pattern recognition, as well as filtering out market irregularities, and providing analysis of the time delays. Both the a priori and a posteriori approaches to the

Acknowledgments

The authors convey thanks for support by the Ministry of Education, Science, Culture, Sports and Technology of Japan. We stay grateful for help of Sarah Hallerberg, Doctor of Philosophy, in particular for illuminating discussion on statistical pattern recognition of extreme events in time series.

References (14)

There are more references available in the full text version of this article.

Cited by (9)

  • Measuring rank correlation coefficients between financial time series: A GARCH-copula based sequence alignment algorithm

    2014, European Journal of Operational Research
    Citation Excerpt :

    A financial time series is a sequence of real numbers, each of which represents one value at a time. In contrast to Hayashi et al. (2007) and Yamano et al. (2008), who transformed financial time series into symbolic sequences, this study directly aligns pairwise numerical sequences. We assume that X = x1, x2, … , xn and Y = y1, y2, … , ym are two real-number financial time series of lengths n and m, respectively, and that f (x, y) is the PDF of X and Y derived by the SJC copula.

  • Modeling professional similarity by mining professional career trajectories

    2014, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
  • Symbolic analysis of Shanghai Stock Market returns

    2011, International Conference on Management and Service Science, MASS 2011
View all citing articles on Scopus
1

On visiting research affiliation.

View full text