Detecting and monitoring abrupt emergences and submergences of episodes over data streams
Introduction
Episodes introduced by Mannila [1] are important patterns for modelling the relative order of occurrences for different types of data elements over a single data sequence. For instance, the order ‘A occurs before C’ can be represented as a serial episode denoted as . Episodes can be used in a variety of sequences and streams in real applications. For example, over a DNA sequence, an episode represents the relative order of positions for different types of nucleotides; over a stream of HTTP requests received by a Web server, an episode represents the order of access to different Web resources; for event streams gathered in sensor networks, an episode represents the relative order of occurrences for different types of events, such as smoke appearances and temperature increases.
Most existing studies on episodes [1], [2], [5], [6], [7], [8], [10] concentrate on frequent episode (FE) discovery. Such studies calculate the frequencies of episodes over a whole sequence, and extract the episodes frequently occurring over the whole sequence. FEs, however, are not suited for data streams due to two reasons: (1) it is impractical to consider the frequencies over a whole stream because the stream is normally unbounded and (2) the frequencies of episodes over a data stream may change (increase or decrease) over time during the whole lifetime of the stream. For example, given a sample stream in Fig. 1, we consider changes in frequencies of episodes over the stream. Intuitively, episode frequently recurs during time interval [1], [5], while in time interval [6], [10] never appears (the frequency decreases sharply) and appears frequently. It is clear that FEs over the whole stream can not capture the dynamic changes.
These changes missed by the FEs may be of crucial significance and interest to stream monitoring and analysis in some real applications. We consider the scenarios in two typical real applications. First, in monitoring a stream of online HTTP requests received by a Web server, a significant increase in the frequencies of particular episodes may indicate the increase of similar users. The appearances of new episodes may also be a sign of access for a new group of users or abnormal access behaviour such as suspicious intrusions. In this scenario, detection and monitoring of such changes is critical for Web managers to observe any change in the population of user groups and to detect suspicious intrusions. A similar situation exists over streams monitored by a sensor network. Changes detected over streams gathered in a sensor network may indicate that particular events are likely to happen (or have just happened). For instance, in a wireless sensor network for detecting forest fires, an emergence of smoke followed by a sharp increase in temperature may be a sign of a fire. Therefore, it is important to detect the changes in the frequency of episodes over data streams.
In this paper, we focus on detecting and tracing the changes in the frequencies of episodes over data streams. Episodes of significant frequency increases are called ‘emerging episodes’ (EEs), and episodes of significant frequency decreases are called ‘submerging episodes’ (SEs). In order to detect the significant changes as early as possible, we only identify abrupt EEs (aEEs) and abrupt SEs (aSEs), i.e., the latest first emergence and submergence of the episodes.
The discovery of aEEs and aSEs faces a number of challenging requirements, such as one-pass of the stream, real-time update and return of results, limited consumption of time and space, and energy saving. This paper aims to propose an aEE–aSE mining method that satisfies the challenging requirements. We choose T-freq [2] to measure frequencies of episodes, and propose an efficient mining method by utilising the novel properties of T-freq. Our main contributions are summarised as follows.
- 1.
We extend existing studies on frequent episode mining by defining and investigating a new problem, aEE–aSE discovery, which detects and traces dynamic changes in frequencies of episodes over time-evolving streams.
- 2.
An efficient one-pass method is proposed for the discovery of aEEs and aSEs.
- 3.
Through extensive experiments, we evaluate the effectiveness and efficiency of the proposed method, demonstrate the discovered patterns are meaningful and natural, and show its distinct advantages against existing frequent episode mining approaches.
The rest of this paper is organised as follows. Section 2 presents preliminaries, the frequency measure and the problem definition. The mining method is proposed in Section 3. Experimental results are addressed in Section 4. Section 5 reviews related work and the conclusion of this paper is presented in Section 6.
Section snippets
Preliminaries, frequency measure and problem definition
Section 2.1 presents preliminaries. Section 2.2 addresses frequency measure, T-freq [2], adopted in this paper. The mining problem is defined in Section 2.3.
The proposed mining method
According to Definition 9, Definition 10, Definition 11, in order to obtain GR, DR, and discover aEEs and aSEs, sup (T-freq) of episodes in Sc and Sr should be computed first. Since the length of Sc and Sr is specified as N, sup is between 1 and N. Let (). Let MFi be the set of maximal FEs w.r.t. . Then the multiple layered maximal frequent episodes, , is essentially a set of closed episodes. Therefore, sup of any episode outside MF is equal to the
Empirical evaluation
The proposed method was evaluated on both synthetic data and real data. Experiments on synthetic data focus on evaluating the effectiveness and efficiency of our method, comparing our method with previous approaches, and identifying their differences and the advantages of the proposed method. The main purpose of the experiments on real data is to verify that the patterns discovered by our method are natural and meaningful, i.e., they reflect real facts and regularities underlying in the real
Related work
This section reviews five major works related to the problem considered in this paper: frequent episode mining, episode rule discovery, emerging pattern mining, change detection in data streams, and sequential pattern mining. The main purpose is to explore the differences and relationships between the related works and our work in this paper.
Conclusions and future work
Existing studies on frequent episode mining [1], [2], [5], [6], [7], [8], [10] consider the frequency of episodes in a whole sequence in a static view, and thus discovered frequent episodes are unable to reflect dynamic changes in time-evolving streams. In order to detect the dynamic changes in the frequency for episodes in time-evolving streams, in this paper, we defined and investigated a new problem; online mining of abrupt emerging episodes and abrupt submerging episodes over data streams,
References (28)
- et al.
Efficient mining of frequent episodes from complex sequences
Information Systems
(2008) - H. Mannila, H. Toivonen, A. Verkamo, Discovering frequent episodes in sequences, in: Proceedings of the 1st ACM SIGKDD...
- K. Iwanuma, R. Ishihara, Y. Takano, H. Nabeshima, Extracting frequent subsequences from a single long data sequence: a...
- M. Gan, H. Dai, A study on the accuracy of frequency measures and its impact on knowledge discovery in single...
- C.G. Gemma, Discovering unbounded episodes in sequential data, in: Proceedings of the 7th European Conference on...
- S. Laxman, P. Sastry, K. Unnikrishnan, A fast algorithm for finding frequent episodes in event streams, in: Proceedings...
- H. Mannila, H. Toivonen, Discovering generalized episodes using minimal occurrences, in: Proceedings of the 2nd ACM...
- et al.
Discovery of frequent episodes in event sequences
Data Mining and Knowledge Discovery
(1997) - N. Meger, C. Rigotti, Constraint-based mining of episode rules and optimal window sizes, in: Proceedings of European...
- W. Zhou, H. Liu, H. Cheng, Mining closed episodes from event sequences efficiently, in: Proceedings of the 14th...
Honeyposts: Tracking Hackers
Cited by (5)
Statistical significance of episodes with general partial orders
2015, Information SciencesCitation Excerpt :Another important issue in pattern discovery is to be able to mine online streams. There has been some recent work in this direction in the episodes context [11,14,15]. Algorithms for discovering episodes with general partial orders have only been reported recently in the literature [39,2].
Discovering frequent chain episodes
2019, Knowledge and Information SystemsPrivacy-Preserving OLAP-based monitoring of data streams: The PP-OMDS approach
2019, CEUR Workshop ProceedingsPP-OMDS: An effective and efficient framework for supporting privacy-preserving OLAP-based monitoring of data streams
2018, ICEIS 2018 - Proceedings of the 20th International Conference on Enterprise Information SystemsFrequent episode mining over the latest window using approximate support counting
2017, CEUR Workshop Proceedings