Symbolic analysis of indicator time series by quantitative sequence alignment
Introduction
The symbolic analysis of time series derived from the economic indicators offers a mechanism of partial noise removal by transferring the quantitative values (e.g. prices) into qualitative data via indexing a set of intervals with a set of symbols (e.g. price range labels), thereby reducing the Lebesgue measure of the original label set from a finite value to zero. Such a strategy is not uncommon; for instance, the computers process continuous problems in discrete manner through sequences represented by physical states of devices. In order to develop a meaningful symbolization filter in form of a customizable stencil applicable to numerical indicators, it is required to define the similarity of symbols, and the sequence information content. Such task is facilitated by the previous studies of sequence comparison in the mathematics and computer science, for instance the classical longest common subsequence algorithm (Hirschberg, 1977), along with a number of variants developed in the computational analysis of hereditary information (Setubal and Meidanis, 1997).
The symbolic analysis in broad sense of such collocation denotes the analysis of symbols, which is interdisciplinary within knowledge representation, and not attributable. Within the context of economic indicators, symbolic analysis enables gradual quantification of the similarity degree among various time series (and their subsets) until the maximal extent of the similarity measure becomes reached.
The above strategy adds value to data analysis in economics, which is quite different from the lossy data compression or de-noising algorithms (Egiazarian et al., 1999) in the computer science. The symbolization technique was previously motivated by complexity reduction in parameter space of trading strategies (Schittenkopf et al., 2002); however it does not confine to the extreme event analysis (Buhlmann, 1998) or decision making. Instead, the approach is designed to reveal the functional similarities and the causal relations in the optimal form. Let us notice here that the wavelet transform presents a powerful tool to optimize dynamic definitions of alphabets, since it efficiently implements adjustable quantization for both the time and frequency domains, and thus provides a mechanism for the accurate estimation of the complexity thresholds (Shin and Han, 2004).
Here we define the symbolic representation and similarity assessment for generic time series of real indicators. Our approach is then applied to the symbolized arrays of stock index data, using a procedure augmented with multiple scoring functions. The main strength of our approach consists in comparing the arbitrary time series and searching for causality relations, based on (1) the scoring of aligned symbol tuples (similarity counting) and (2) the gap insert operations (sequence editing). The notion of gap for the sequence edit operation implies admittance of the time delay (or speedup) in the propagation of economic effects (quantifiable with construction of the complete sequence alignment), which is the key point for applications in the causality recognition.
The article is organized as follows. Section 2 outlines the two sides of symbol definition: a priori layout of the data bins for intervals in the predicted range of indicator values, vs. a posteriori partitioning of indicator histogram. The scoring functions for symbol comparison are tailored and explained with a specific viewpoint on the indicators and gap insert (time delay). Section 3 quantifies the similarity of symbol arrays, and shows actual sequence alignment obtained with the dynamic programming algorithm. Applications and benchmarks on various time series of archetype indicators and market data are examined in Section 4. The concluding remarks in Section 5 show the directions of future research towards dynamical scoring functionals, and the methods of their optimization.
Section snippets
Symbolization and scoring
This section provides on overview of the basic symbolization procedure. First, both the ex-ante and ex-post approaches to the data symbolization are explained, and then a scoring scheme for the symbol comparison is stated.
Let be the time series of economic indicator with the range of values in real numbers between and , admitting . The symbolic representation of such indicator is built upon a countable set of indices , coupled with the delimiter values
Similarity and sequence alignment
This section briefly summarizes the two-step dynamic programming procedure (Bellman, 1957), namely the recursive fill of a scoring matrix for substring alignments, and the reverse traceback procedure for generating the optimal alignment.
The alignment of the two symbol series is defined as two sequences and (), where each sequence contains the elements in preserved order, except for the possible gap inserts. The similarity measure of both sequences is
Applications and benchmarks
The formalism of symbolic analysis is first briefly benchmarked on the noise detection and signal positioning problem, dynamics regime extraction for GARCH processes, and the correlation study for imperfect pseudo-random time series. This framework is then applied to the comparison of the stock index data, in particular a priori alignment of BVSP index (Bovespa Brazil) with Dow Jones index, and an a posteriori alignment of Tokyo Stock Exchange 225 and Dow Jones indices.
Optimization of symbol definitions
In order to test the symbolic analysis from yet another angle, we have performed a full screening over all possible definitions of the five-letter alphabet used in this work. To this aim, we have selected the first elements of the Tokyo Stock Exchange Index 225 sequence starting on 01/04/1984, , , where are the log-normalized returns of the index on the daily basis. The alignment of the corresponding symbolic sequence , and a shifted sequence have been
Concluding remarks
We have developed the symbolic analysis approach to comparing the time series of indicators. Our method allows for aligning the subsets of time series, and rigorous quantification of the similarity degree for such alignments. It was thoroughly demonstrated how much the symbolic analysis is capable of the cross-market similarity pattern recognition, as well as filtering out market irregularities, and providing analysis of the time delays. Both the a priori and a posteriori approaches to the
Acknowledgments
The authors convey thanks for support by the Ministry of Education, Science, Culture, Sports and Technology of Japan. We stay grateful for help of Sarah Hallerberg, Doctor of Philosophy, in particular for illuminating discussion on statistical pattern recognition of extreme events in time series.
References (14)
Generalised autoregressive conditional heteroskedasticity
Journal of Econometrics
(1986)- et al.
A general method applicable to the search for similarities in the amino acid sequence of two proteins
Journal of Molecular Biology
(1970) - et al.
Identification of common molecular subsequences
Journal of Molecular Biology
(1981) Dynamic Programming
(1957)Extreme events from return-volume process: A discretization approach for complexity reduction
Applied Financial Economics
(1998)- et al.
Adaptive denoising and lossy compression of images in transform domain
Journal of Electronic Imaging
(1999) Algorithms for the longest common subsequence problem
Journal of the ACM
(1977)
Cited by (9)
Finding Patterns of Stock Returns Based on Sequence Alignment
2018, Procedia Computer ScienceMeasuring rank correlation coefficients between financial time series: A GARCH-copula based sequence alignment algorithm
2014, European Journal of Operational ResearchCitation Excerpt :A financial time series is a sequence of real numbers, each of which represents one value at a time. In contrast to Hayashi et al. (2007) and Yamano et al. (2008), who transformed financial time series into symbolic sequences, this study directly aligns pairwise numerical sequences. We assume that X = x1, x2, … , xn and Y = y1, y2, … , ym are two real-number financial time series of lengths n and m, respectively, and that f (x, y) is the PDF of X and Y derived by the SJC copula.
Finding hidden pattern of financial time series based on score matrix in sequence alignment
2018, Asian Economic and Financial ReviewModeling professional similarity by mining professional career trajectories
2014, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data MiningSymbolic analysis of Shanghai Stock Market returns
2011, International Conference on Management and Service Science, MASS 2011
- 1
On visiting research affiliation.