Elsevier

Information Systems

Volume 39, January 2014, Pages 277-289
Information Systems

Detecting and monitoring abrupt emergences and submergences of episodes over data streams

https://doi.org/10.1016/j.is.2012.05.009Get rights and content

Abstract

Existing studies on episode mining mainly concentrate on the discovery of (global) frequent episodes in sequences. However, frequent episodes are not suited for data streams because they do not capture the dynamic nature of the streams. This paper focuses on detecting dynamic changes in frequencies of episodes over time-evolving streams. We propose an efficient method for the online detection of abrupt emerging episodes and abrupt submerging episodes over streams. Experimental results on synthetic data show that the proposed method can effectively detect the defined patterns and meet the strict requirements of stream processing, such as one-pass, real-time update and return of results, plus limited time and space consumption. Experimental results on real data demonstrate that the patterns detected by our method are natural and meaningful. The proposed method has wide applications in stream monitoring and analysis as the discovered patterns indicate dynamic emergences/disappearances of noteworthy events/phenomena hidden in the streams.

Introduction

Episodes introduced by Mannila [1] are important patterns for modelling the relative order of occurrences for different types of data elements over a single data sequence. For instance, the order ‘A occurs before C’ can be represented as a serial episode denoted as AC. Episodes can be used in a variety of sequences and streams in real applications. For example, over a DNA sequence, an episode represents the relative order of positions for different types of nucleotides; over a stream of HTTP requests received by a Web server, an episode represents the order of access to different Web resources; for event streams gathered in sensor networks, an episode represents the relative order of occurrences for different types of events, such as smoke appearances and temperature increases.

Most existing studies on episodes [1], [2], [5], [6], [7], [8], [10] concentrate on frequent episode (FE) discovery. Such studies calculate the frequencies of episodes over a whole sequence, and extract the episodes frequently occurring over the whole sequence. FEs, however, are not suited for data streams due to two reasons: (1) it is impractical to consider the frequencies over a whole stream because the stream is normally unbounded and (2) the frequencies of episodes over a data stream may change (increase or decrease) over time during the whole lifetime of the stream. For example, given a sample stream in Fig. 1, we consider changes in frequencies of episodes over the stream. Intuitively, episode AC frequently recurs during time interval [1], [5], while in time interval [6], [10] AC never appears (the frequency decreases sharply) and XY appears frequently. It is clear that FEs over the whole stream can not capture the dynamic changes.

These changes missed by the FEs may be of crucial significance and interest to stream monitoring and analysis in some real applications. We consider the scenarios in two typical real applications. First, in monitoring a stream of online HTTP requests received by a Web server, a significant increase in the frequencies of particular episodes may indicate the increase of similar users. The appearances of new episodes may also be a sign of access for a new group of users or abnormal access behaviour such as suspicious intrusions. In this scenario, detection and monitoring of such changes is critical for Web managers to observe any change in the population of user groups and to detect suspicious intrusions. A similar situation exists over streams monitored by a sensor network. Changes detected over streams gathered in a sensor network may indicate that particular events are likely to happen (or have just happened). For instance, in a wireless sensor network for detecting forest fires, an emergence of smoke followed by a sharp increase in temperature may be a sign of a fire. Therefore, it is important to detect the changes in the frequency of episodes over data streams.

In this paper, we focus on detecting and tracing the changes in the frequencies of episodes over data streams. Episodes of significant frequency increases are called ‘emerging episodes’ (EEs), and episodes of significant frequency decreases are called ‘submerging episodes’ (SEs). In order to detect the significant changes as early as possible, we only identify abrupt EEs (aEEs) and abrupt SEs (aSEs), i.e., the latest first emergence and submergence of the episodes.

The discovery of aEEs and aSEs faces a number of challenging requirements, such as one-pass of the stream, real-time update and return of results, limited consumption of time and space, and energy saving. This paper aims to propose an aEE–aSE mining method that satisfies the challenging requirements. We choose T-freq [2] to measure frequencies of episodes, and propose an efficient mining method by utilising the novel properties of T-freq. Our main contributions are summarised as follows.

  • 1.

    We extend existing studies on frequent episode mining by defining and investigating a new problem, aEE–aSE discovery, which detects and traces dynamic changes in frequencies of episodes over time-evolving streams.

  • 2.

    An efficient one-pass method is proposed for the discovery of aEEs and aSEs.

  • 3.

    Through extensive experiments, we evaluate the effectiveness and efficiency of the proposed method, demonstrate the discovered patterns are meaningful and natural, and show its distinct advantages against existing frequent episode mining approaches.

The rest of this paper is organised as follows. Section 2 presents preliminaries, the frequency measure and the problem definition. The mining method is proposed in Section 3. Experimental results are addressed in Section 4. Section 5 reviews related work and the conclusion of this paper is presented in Section 6.

Section snippets

Preliminaries, frequency measure and problem definition

Section 2.1 presents preliminaries. Section 2.2 addresses frequency measure, T-freq [2], adopted in this paper. The mining problem is defined in Section 2.3.

The proposed mining method

According to Definition 9, Definition 10, Definition 11, in order to obtain GR, DR, and discover aEEs and aSEs, sup (T-freq) of episodes in Sc and Sr should be computed first. Since the length of Sc and Sr is specified as N, sup is between 1 and N. Let min_supi=i (i=1,2,,N). Let MFi be the set of maximal FEs w.r.t. min_supi. Then the multiple layered maximal frequent episodes, MF=i=1NMFi, is essentially a set of closed episodes. Therefore, sup of any episode α outside MF is equal to the

Empirical evaluation

The proposed method was evaluated on both synthetic data and real data. Experiments on synthetic data focus on evaluating the effectiveness and efficiency of our method, comparing our method with previous approaches, and identifying their differences and the advantages of the proposed method. The main purpose of the experiments on real data is to verify that the patterns discovered by our method are natural and meaningful, i.e., they reflect real facts and regularities underlying in the real

Related work

This section reviews five major works related to the problem considered in this paper: frequent episode mining, episode rule discovery, emerging pattern mining, change detection in data streams, and sequential pattern mining. The main purpose is to explore the differences and relationships between the related works and our work in this paper.

Conclusions and future work

Existing studies on frequent episode mining [1], [2], [5], [6], [7], [8], [10] consider the frequency of episodes in a whole sequence in a static view, and thus discovered frequent episodes are unable to reflect dynamic changes in time-evolving streams. In order to detect the dynamic changes in the frequency for episodes in time-evolving streams, in this paper, we defined and investigated a new problem; online mining of abrupt emerging episodes and abrupt submerging episodes over data streams,

References (28)

  • K. Huang et al.

    Efficient mining of frequent episodes from complex sequences

    Information Systems

    (2008)
  • H. Mannila, H. Toivonen, A. Verkamo, Discovering frequent episodes in sequences, in: Proceedings of the 1st ACM SIGKDD...
  • K. Iwanuma, R. Ishihara, Y. Takano, H. Nabeshima, Extracting frequent subsequences from a single long data sequence: a...
  • M. Gan, H. Dai, A study on the accuracy of frequency measures and its impact on knowledge discovery in single...
  • C.G. Gemma, Discovering unbounded episodes in sequential data, in: Proceedings of the 7th European Conference on...
  • S. Laxman, P. Sastry, K. Unnikrishnan, A fast algorithm for finding frequent episodes in event streams, in: Proceedings...
  • H. Mannila, H. Toivonen, Discovering generalized episodes using minimal occurrences, in: Proceedings of the 2nd ACM...
  • H. Mannila et al.

    Discovery of frequent episodes in event sequences

    Data Mining and Knowledge Discovery

    (1997)
  • N. Meger, C. Rigotti, Constraint-based mining of episode rules and optimal window sizes, in: Proceedings of European...
  • W. Zhou, H. Liu, H. Cheng, Mining closed episodes from event sequences efficiently, in: Proceedings of the 14th...
  • L. Spitzner

    Honeyposts: Tracking Hackers

    (2003)
  • List of Honeypots Solutions 〈http://www.tracking-hackers.com/solutions/〉 (last accessed...
  • KFSensor 〈http://www.keyfocus.net/index.php〉 (last accessed...
  • Trial Version of KFSensor 〈http://www.keyfocus.net/kfsensor/download/〉 (last accessed...
  • Cited by (5)

    • Statistical significance of episodes with general partial orders

      2015, Information Sciences
      Citation Excerpt :

      Another important issue in pattern discovery is to be able to mine online streams. There has been some recent work in this direction in the episodes context [11,14,15]. Algorithms for discovering episodes with general partial orders have only been reported recently in the literature [39,2].

    • Discovering frequent chain episodes

      2019, Knowledge and Information Systems
    • PP-OMDS: An effective and efficient framework for supporting privacy-preserving OLAP-based monitoring of data streams

      2018, ICEIS 2018 - Proceedings of the 20th International Conference on Enterprise Information Systems
    View full text