A Parallel Approach to Discords Discovery in Massive Time Series Data

A discord is a refinement of the concept of an anomalous subsequence of a time series. Being one of the topical issues of time series mining, discords discovery is applied in a wide range of real-world areas (medicine, astronomy, economics, climate modeling, predictive maintenance, energy consumption, etc.). In this article, we propose a novel parallel algorithm for discords discovery on high-performance cluster with nodes based on many-core accelerators in the case when time series cannot fit in the main memory. We assumed that the time series is partitioned across the cluster nodes and achieved parallelization among the cluster nodes as well as within a single node. Within a cluster node, the algorithm employs a set of matrix data structures to store and index the subsequences of a time series, and to provide an efficient vectorization of computations on the accelerator. At each node, the algorithm processes its own partition and performs in two phases, namely candidate selection and discord refinement, with each phase requiring one linear scan through the partition. Then the local discords found are combined into the global candidate set and transmitted to each cluster node. Next, a node performs refinement of the global candidate set over its own partition resulting in the local true discord set. Finally, the global true discords set is constructed as intersection of the local true discord sets. The experimental evaluation on the real computer cluster with real and synthetic time series shows a high scalability of the proposed algorithm.


Introduction
Currently, the discovery of anomalous subsequences in a very long time series is a topical issue in a wide spectrum of real-world applications, namely medicine, astronomy, economics, climate modeling, predictive maintenance, energy consumption, and others. For such applications, it is hard to deal with multi-terabyte time series, which cannot fit into the main memory.
algorithms, which typically require many parameters [2]. HOTSAX, however, assumes that time series reside in main memory.
Further, Yankov, Keogh et al. proposed a disk-aware algorithm (for brevity, referred to as DADD, Disk-Aware Discord Discovery) based on the range discord concept [3]. For a given range r, DADD finds all discords at a distance of at least r from their nearest neighbor. The algorithm requires two linear scans through the time series on a disk. Later, Yankov, Keogh et al. [4] discussed parallelization of DADD based on the MapReduce paradigm. However, in the experimental evaluation, the authors just simulated the above-mentioned parallel implementation on up to eight computers.
Our research is devoted to parallel and distributed algorithms for time series mining. In the previous work [5], we parallelized HOTSAX for many-core accelerators. This article continues our study and contributes as follows. We present a parallel implementation of DADD on the high-performance cluster with the nodes based on many-core accelerators. The original algorithm is extended by a set of index structures to provide an efficient vectorization of computations on each cluster node. We carried out the experiments on the real computer cluster with the real and synthetic time series, which showed a high scalability of our approach.
The rest of the article is organized as follows. In Section 2, we give the formal definitions along with a brief description of DADD. Section 3 provides the brief state of the art literature review. Section 4 presents the proposed methodology. In Section 5, the results of the experimental evaluation of our approach have been provided. Finally, Section 6 summarizes the results obtained and suggests directions for a further research.

Notations and Definitions
Below, we follow [4] to give some formal definitions and the statement of the problem.
A time series T is a sequence of real-valued elements: The length of a time series is denoted by T j j.
A subsequence T i;n of a time series T is its contiguous subset of n elements that starts at position i: T i;n ¼ t i ; . . . ; t iþnÀ1 ð Þ , 1 n ( m, 1 i m À n þ 1. We denote the set of all subsequences of length n in T by S n T . Let N denotes the number of subsequences in S n T , i.e., N ¼ S n T ¼ m À n þ 1.
A distance function for any two subsequences is a nonnegative and symmetric function R n Â R n ! R.
Two subsequences T i;n ; T j;n 2 S n T are non-trivial matches [6] with respect to a distance function Dist, if 9T p;n 2 S n T , i < p < j, and Dist T i;n ; T j;n À Á < Dist T i;n ; T p;n À Á . Let us denote a non-trivial match of a subsequence C 2 S n T by M C . A subsequence D 2 S n T is said to be the most significant discord in T if the distance to its nearest nontrivial match is the largest. That is, A subsequence D 2 S n T is called the most significant k-th discord in T if the distance to its k-th nearest non-trivial match is the largest.
Given a positive parameter r, the discord at a distance at least r from its nearest neighbor is called the range discord, i.e., for discord D min Dist D; M D ð Þ ð Þ!r.
DADD, the original serial disk-based algorithm [4] addresses discovering range discords, and provides researchers with a procedure to choose the r parameter. To accelerate the above-mentioned procedure, our parallel algorithm [5] for many-core accelerators can be applied, which discovers discords for the case when time series fit in the main memory.
When computing distance between subsequences, DADD demands that the arguments have been previously z-normalized to have mean zero and a standard deviation of one. Here, z-normalization of a subsequence C 2 S n T is defined as a subsequenceĈ ¼ĉ 1 ; . . . ;ĉ n ð Þin whicĥ In the original DADD algorithm, the Euclidean distance is used as a distance measure yet the algorithm can be utilized with any distance function, which may not necessarily be a metric [4]. Given two subsequences X ; Y 2 S n T , the Euclidean distance between them is calculated as

The Original Algorithm
The DADD algorithm [4] performs in two phases, namely the candidate selection and discord refinement, with each phase requiring one linear scan through the time series on disk. Algorithm 1 depicts a pseudo code of DADD (up to the replacement of the Euclidean distance by an arbitrary distance function). The algorithm takes time series T and range r as an input and outputs set of discords C. For a discord c 2 C, we denote the distance to its nearest neighbor as c:dist.
At the first phase, the algorithm scans through the time series T , and for each subsequence s 2 S n T it validates the possibility for each candidate c already in the set C to be discord. If a candidate c fails the validation, then it is removed from this set. In the end, the new s is either added to the candidates set, if it is likely to be a discord, or it is discarded. The correctness of this procedure is proved in Yankov et al. [4].
At the second phase, the algorithm initially sets distances of all candidates to their nearest neighbors to infinity. Then, the algorithm scans through the time series T, calculating the distance between each subsequence s 2 S n T and each candidate c. Here, when calculating ED s; c ð Þ, the EarlyAbandonED procedure stops the summation of P n k¼1 s k À c k ð Þ 2 if it reaches k ¼ ', such that 1 ' n for which If the distance is less than r then the candidate is false positive and permanently removed from C. If the above-mentioned distance is less than the current value of c:dist (and still greater than r, otherwise it would have been removed) then the current distance to the nearest neighbor is updated.

Related Work
Being introduced in Keogh et al. [1], currently, time-series discords are considered one of the best techniques for the time series anomaly detection [7].
The original HOTSAX algorithm [1] is based on the SAX (Symbolic Aggregate ApproXimation) transformation [8]. Among the improvements of HOTSAX, we can mention the following algorithms, namely iSAX [9] and HOT-iSAX [10] (indexable SAX), WAT [11] (Haar wavelets instead of SAX), HashDD [12] (use of a hash table instead of the prefix trie), HDD-MBR [13] (application of R-trees), and BitClusterDiscord [14] (clustering of the bit representation of subsequences). However, the abovementioned algorithms are able to discover discords if the time series fits in the main memory, and have no parallel implementations, to the best of our knowledge.
Further, Yankov, Keogh et al. [3] overcame the main memory size limitation having proposed a diskaware discord discovery algorithm (DADD) based on the range discord concept. For a given range r, DADD finds all discords at a distance of at least r from their nearest neighbor. The algorithm performs in two phases, namely the candidate selection and discord refinement, with each phase requiring one linear scan through the time series on the disk.
There are a couple of worth-noting works devoted to parallelization of DADD. The DDD (Distributed Discord Discovery) algorithm [15] parallelizes DADD through a Spark cluster [16] and HDFS (Hadoop Distributed File System) [17]. DDD distributes time series onto the HDFS cluster and handles each partition in a memory of a computing node. As opposed to DADD, DDD computes the distance without taking advantage of an upper bound for early abandoning, which would increase the algorithm's performance.
The PDD (Parallel Discord Discovery) algorithm [18] also utilizes a Spark cluster but employs transmission of a subsequence and its non-trivial matches to one or more computing nodes to calculate the distance between them. A bulk of continuous subsequences is transmitted and calculated in a batch mode to reduce the message passing overhead. PDD is not scalable since intensive message passing between the cluster nodes leads to a significant degradation of the algorithm's performance as the number of nodes increases.
In their further work [4], Yankov, Keogh et al. discussed the parallel version of DADD based on the MapReduce paradigm (hereinafter referred to as MR-DADD), and the basic idea is as follows. Let the input time series T be partitioned evenly across P cluster nodes. Each node performs the selection phase on its own partition with the same r parameter and produces distinct candidate set C i . Then the combined candidate set C P is constructed as C P ¼ [ P i¼1 C i and transmitted to each cluster node. Next, a node performs the refinement phase on its own partition taking C P as an input, and produces the refined candidate set C i . The final discords are given by the set C P ¼ \ P i¼1 C i . In the experimental evaluation, the authors, however, just simulated the above-mentioned scheme on up to eight computers resulting in a near-to-linear speedup.
Concluding this brief review, we should also mention the matrix profile (MP) concept proposed by Keogh et al. [19]. MP is a data structure that annotates a time series, and can be applied to solve an impressively large list of time series mining problems including discords discovery but at computational cost of O m 2 ð Þ where m is the time series length [19,20]. Recent parallel algorithms of the MP computation include GPU-STAMP [19] and MP-HPC [21], which are implementations for graphic processors through CUDA (Compute Unified Device Architecture) technology and computer cluster through MPI (Message Passing Interface) technology, respectively.

Discords Discovery on Computer Cluster with Many-Core Accelerators
The parallelization employs a two-level parallelism, namely across cluster nodes and among threads of a single node. We implemented these levels through partitioning of an input time series and MPI technology, and OpenMP technology, respectively. Within a single node, we employed the matrix representation of data to effectively parallelize computations through OpenMP. Below, we will show an approach to the implementation of these ideas.

Time Series Representation
To provide parallelism at the level of the cluster nodes, we perform time series partitioning across the nodes as follows. Let P be the number of nodes in the cluster, then k-th partition (0 k P À 1) of the time series is defined as T start; len where This means the head part of every partition except the first overlaps with the tail part of the previous partition in n À 1 data points. Such a technique prevents us from a loss of the resulting subsequences in the junctions of two neighbor partitions. To simplify the presentation of the algorithm, hereinafter in this section, we use symbol T and the above-mentioned related notions implying a partition on the current node but not the whole input time series.
The time series partition is stored as a matrix of aligned subsequences to enable computations over aligned data with as many auto-vectorizable loops as possible. We avoid the unaligned memory access since it can cause an inefficient vectorization due to time overhead for the loop peeling [22].
Let us denote the number of floats stored in the VPU (vector processing unit of the many-core accelerator) by width VPU . If the discord length n is not a multiple of width VPU , then each subsequence is padded with zeroes where the number of zeroes is calculated as pad ¼ width VPU À n mod width VPU ð Þ . Thus, the aligned (and previously z-normalized) subsequenceT i;n is defined as follows: T i;n ¼ ðT i;n ; 0; . . . ; 0 zfflfflffl ffl}|fflfflffl ffl{ pad Þ; if n mod width VPU > 0 T i;n ; otherwise: The subsequence matrix S n T 2 R N Â nþpad ð Þ is defined as

Internal Data Layout
The parallel algorithm employs the data structures depicted in Fig. 1. Defining structures to store data in the main memory of a cluster node, we suppose that each structure is shared by all threads the algorithm is running on, and each thread processes its own data segment independently. Let us denote the amount of threads employing by the algorithm on a cluster node by p, and let iam (0 iam p À 1) denotes the number of the current thread.
Set of discords C is implemented as an object with two basic attributes, namely candidate index and candidate body, to store indices of all potential discord subsequences and their values themselves, respectively.
Let us denote a ratio of candidates selected at a cluster node and all subsequences of the time series by n. The exact value of the n parameter is a subject of an empirical choice. In our experiments, n ¼ 0:01 was enough to store all candidates. Thus, we denote the number of candidates as L ¼ n Á N d e and assume that L ( N .
The candidate index is organized as a matrix C:index 2 N pÂL , which stores indices of candidates in the subsequence matrix S n T found by each thread, i.e., i-th row keeps indices of the candidates that have been found by i-th thread. Initially, the candidate index is filled by NULL values.
To provide a fast access to the candidate index during the selection phase, it is implemented as a deque (double-ended queue) with three attributes, namely count, head, and tail. The deque count is an array C:count 2 N p , which for each thread keeps the amount of non-NULL elements in the respective row of the candidate index matrix. The deque head and tail are arrays C:head and C:tail 2 N p , respectively, which represent the second-level indices that for each thread keep a number of column in C:index with the most recent NULL value, and with the least recent non-NULL value, respectively.
Let H (H < L ( N ) be the number of candidates selected at a cluster node during the algorithm's first phase. Then the candidate body is the matrix C:cand 2 R HÂn , which represents the candidate subsequences itself. The candidate body is accompanied by an array C:pos 2 N H , which stores starting points of candidate subsequences in the input time series.
After the selection phase, all the nodes exchange the candidates found to construct the combined candidate set, so at each cluster node the candidate body will contain potential discords from all the nodes. At the second phase, the algorithm refines the combined candidate set comparing the parameter r and distances between each element of the candidate body and each element of the subsequence matrix.
To parallelize this activity, we process rows of the subsequence matrix in the segment-wise manner and employ an additional attribute of the candidate body, namely bitmap. The bitmap is organized as a matrix C:bitmap 2 B pÂH , which indicates the fact that an element of the candidate body has been successfully

Parallel Implementation of the Algorithm
In the implementation, we apply the following parallelization scheme at the level of the cluster nodes. Let the input time series T be partitioned evenly across P cluster nodes. Each node performs the selection phase on its own partition with the same threshold parameter r and produces distinct candidate set C i .
Next, as opposed to MR-DADD [4], each node refines its own candidate set C i with respect to the r value. Indeed, a candidate cannot be the true discord if it is pruned in the refinement phase in at least one cluster node. Thus, by the local refinement procedure, we try to reduce each candidate set C i and, in turn, the combined candidate set C P ¼ [ P i¼1 C i : In the experiments, this allows us to reduce the size of the combined candidate set at times.
Then the combined candidate set C P is constructed and transmitted to each cluster node. Next, a node refines C P over its own partition, and produces the result C i . Finally, the true discords set is constructed as C P ¼ \ P i¼1 C i . The parallel implementation of the candidate selection and refinement phases is depicted in Algorithm 2 and Algorithm 3, respectively. To speed up the computations at a cluster node, we omit the square root calculation since this does not change the relative ranking of the candidates (indeed, the ED function is monotonic and concave).
In the selection phase, we parallelize the outer loop along the rows of the subsequence matrix while in the inner loop along the candidates, each thread processes its own segment of the candidate index. By the end of the phase, the candidates found by each thread are placed into the candidate body, and all the cluster nodes In the refinement phase, we also parallelize the outer loop along the rows of the subsequence matrix, and in the inner loop along the candidates, each thread processes its own segments of the candidate body and bitmap. In this implementation, we do not use the early abandoning technique for the distance calculation relying on the fact that vectorization of the square of the Euclidean distance may give us more benefits. By the end of the phase, the column-wise conjunction of the elements in the bitmap matrix will result in a set of true discords found by the current cluster node. An intersection of such sets is implemented by one of the cluster nodes where the rest nodes send their resulting sets.

Experiments
We evaluated the proposed algorithm during the experiments conducted on the Tornado SUSU computer cluster [23] with the nodes based on the Intel MIC accelerators [24]. Each cluster node is equipped by the Intel Xeon Phi SE10X accelerator with a peak performance 1.076 TFLOPS (60 cores at 1.1 GHz with hyper-threading factor 4Â). In the experiments, we investigated scalability of our approach and compared it with analogs, and the results are given below in Sections 5.1 and 5.2, respectively.

The Algorithm's Scalability
In the first series of the experiments, we assessed the algorithm's scaled speedup, which is defined as the speedup obtained when the problem size is increased linearly with the number of the nodes added to the computer cluster [25]. Being applied to our problem, the algorithm's scaled speedup is calculated as where n is the discord length, P is the number of the cluster nodes, m is a factor of the time series length, C PÁm ð Þ is a set of all the candidate discords selected by the algorithm at its first phase from a time series of length P Á m and t PÁ PÁm ð Þ is the algorithm's run time when the time series is processed on P nodes. For the evaluation, we took ECG time series [26] (see Tab. 1 for the summary of the data involved). In the experiments, we discovered discords on up to 128 cluster nodes with the time series factor m ¼ 10 6 , and varied the discord's length n while the range parameter r was chosen empirically to provide the algorithm's best performance.
The results of the experiments are depicted in Fig. 2. As can be seen, our algorithm adapts well to increasing both the time series length and number of cluster nodes, and demonstrates the linear scaled speedup. As expected, the algorithm shows a better scalability with larger values of the discord length because this provides a higher computational load.

Comparison with Analogs
In the second series of the experiments, we compared the performance of our algorithm against the analogs we have already considered in Section 3, namely DDD [15], MR-DADD [4], GPU-STAMP [19], and MP-HPC [21]. We omit the PDD algorithm [18] since in our previous experiments [5], PDD was  Throughout the experiments, we used the synthetic time series generated according the Random Walk model [27] as that ones were employed for the evaluation by the competitors. For comparison purposes, we used the run times reported by the authors of the respective algorithms. To perform the comparison, we ran our algorithm on Tornado SUSU with a reduced number of nodes and cores at a node to make the peak performance of our hardware platform approximately equal to that of the system on which the corresponding competitor was evaluated.
Tab. 2 summarizes the performance of the proposed algorithm compared with the analogs. We can see that our algorithm outruns its competitors. As expected, direct analogs DDD and MR-DADD are inferior to our algorithm since they do not employ parallelism within a single cluster node. Additionally, indirect analogs GPU-STAMP and MP-HPC are behind our algorithm since they initially aim to solve a computationally more complex problem of computing the matrix profile, which can also be used for discords discovery among many other time series mining problems.

Conclusion
In this article, we addressed the problem of discovering the anomalous subsequences in a very long time series. Currently, there is a wide spectrum of real-world applications where it is typical to deal with multiterabyte time series, which cannot fit in the main memory: medicine, astronomy, economics, climate modeling, predictive maintenance, energy consumption, and others. In the study, we employ the discord concept, which is a subsequence of the time series that has the largest distance to its nearest non-self match neighbor subsequence.
We proposed a novel parallel algorithm for discords discovery in very long time series on the modern high-performance cluster with the nodes based on many-core accelerators. Our algorithm utilizes the serial disk-aware algorithm by Yankov, Keogh et al. [4] as a basis. We achieve parallelization among the cluster nodes as well as within a single node. At the level of the cluster nodes, we modified the original parallelization scheme that allowed us to reduce the number of candidate discords to be processed. Within a single cluster node, we proposed a set of the matrix data structures to store and index the subsequences of a time series, and to provide an efficient vectorization of computations on the many-core accelerator.
The experimental evaluation on the real computer cluster with the real and synthetic time series shows the high scalability of the proposed algorithm. Throughout the experiments on real computer cluster over real and synthetic time series, our algorithm showed the linear scalability, increasing in the case of a high computational load due to a greater discord length. Also, the algorithm's performance was ahead of the analogs that do not employ both computer cluster and many-core accelerators.
In further studies, we plan to elaborate versions of the algorithm for computer clusters with GPU nodes.
Funding Statement: This work was financially supported by the Russian Foundation for Basic Research (Grant No. 20-07-00140) and by the Ministry of Science and Higher Education of the Russian Federation (Government Order FENU-2020-0022).

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.