Efficient Approach to Discover Interval-Based Sequential Patterns

In most of the sequential pattern mining methodology they have concentrated only on time point base event data. But some research efforts have detailed the mining patterns from time interval based event data. In many application most of the events are occurred at time interval based event not a point based interval for example patient affected by the certain time period. Our goal is to mine the frequently occurred sequential patterns in the database. In this study we have introduced a new algorithm namely KPrefixspan by modifying the TPrefixspan algorithm to overcome the demerits of that algorithm. Here new approach called refined database can reduce the scanning time extremely since the unsupported events are removed at each projection also result of the sequential pattern is extremely precise. Experiments constructed for synthetic datasets. From the experimental results we reduced the running time almost 60% and also reduce the memory usage almost 25% when compared to the existing TPrefixspan algorithm


INTRODUCTION
Generally, data mining tasks can be classified into two main types: Descriptive mining and Predictive mining.Descriptive data mining refers to the depiction of a dataset in a brief and summarized manner and discloses the significant properties of the data.Generalization is the basis of descriptive data mining approaches, which can be used to shorten the data by applying attribute-oriented induction with the aid of characteristic rules and generalized relations (Han and Kamber, 2001).Some of the descriptive mining techniques are Clustering (Liu and Yu, 2005), Association Rule Mining and Sequential Pattern Mining.On the other hand, predictive mining is the process of deriving patterns from data to make predictions.Classification, Regression and Deviation detection are some of the most important processes concerned in predictive mining techniques.Concisely, descriptive data mining aims to summarize the data and also highlights their interesting properties, while predictive data mining aspires to build models to forecast future behaviors (Han and Kamber, 2001).
Sequential pattern mining is one of the imperative subjects of data mining, which is an additional endorsement of association rule mining (Masseglia et al., 2003).The sequential pattern mining algorithm deals with the problem of finding the existing frequent sequences in a given database.Sequential pattern mining is strongly related to association rule mining, excepting that the events are associated by time (Sobh, 2007).Sequential patterns signify the association among transactions while association rules describe the intra transaction relationships.In association rule mining, the mined output is about the items that are bought together frequently in a single transaction.Whereas, the output of sequential pattern mining represents the items which are bought in a particular order by the same customer in diverse transactions (Zhao and Bhowmick, 2003).

JCS
Most database related applications are temporal in nature, for example, financial applications such as portfolio management, accounting and banking; most of the applications depend on temporal databases that record time-referenced data (Jensen, 2000).Although much successful research has been made in the field of 'static' data mining, still there's much scope for further research regarding its extension to temporal data mining, wherein the temporal dimension is represented and reasoned about explicitly (Moskovitch and Shahar, 2005).
Time series prediction, sequence classification, sequence clustering, search and retrieval of sequences and pattern discovery are the five most important processes carried out for achieving temporal data mining tasks (Laxman and Sastry, 2006).Among them, Pattern Discovery has drawn a great deal of attention owing to its substantial use in stock trend prediction and application that using the history of symptoms to diagnose certain kind of diseases.There are two prominent frameworks for frequent pattern discovery: sequential patterns and episodes (Laxman and Sastry, 2006).In temporal data mining, mining of large sequential datasets is carried out, where the data is ordered with respect to some index (Antunes and Oliveira, 2001).
Mining of sequential pattern in time series data is often carried out in various fields in order to make a prediction and an opposite model should be proposed before the prediction can be done, therefore, the way how to discover time series pattern from time series database becomes extremely significant (Zhu et al., 2009).The sequence of events corresponds to a sequence of instants when these events happen.But, there are various situations where events have certain duration and so, the underlying time is computed in terms of intervals instead of points.Our work is motivated by several prior researches which are related to mining of temporal sequences from the time interval data (Guyet and Quiniou, 2011;Chen et al., 2010;Wu and Chen, 2007;Patel et al., 2008).
In this study, we proposed an algorithm called KPrefixspan to extract the frequently occurred sequential patterns.Our ultimate goal is to mine the time interval based sequential patterns efficiently, so that the frequently occurred sequential data's are computed by using the KPrefixspan algorithm.The proposed approach comprises three major steps: (i) creating refined database, (ii) constructing patterns based on time interval and (iii) mining sequential patterns based on projection database.In the projection stage, sequences having different length are selected from each projection such as, one length pattern, two length pattern.In each projection, the unsupported events are removed for reducing the scanning time in order to obtain accurate results.

Database
Database DB consist a set of patient P = {P 1 ,..., P i }, 1 ≤i≤k where k illustrate the total number of patient where each patient p i having the list of disease D = {d 1 ,...,d j } 1≤j≤1 , where l is the total number of disease for each patient.The each disease d j having the time intervals which are starting time corresponds to t s and ending time corresponds to t e where the starting time of the disease always less than ending time of the disease t s <t e .

Generation of Refined Database (RDB)
The refined database is constructed from the original database which consists of less number of diseases when compared with the original database DB.We construct the RDB by removing the diseases having the value that will be less than threshold value T h .The threshold value must be less than number of patient k.Calculate the number, from the DB for how much patients are affected by each disease N (d j ).The diseases are removed from the DB when the value of N (d j ) becomes less than the value of threshold N (d j ) < T h .The remaining diseases are placed in the Refined Database (RDB).

Building the Sequences of Diseases P i [S (d j )]
From the refined database we can build the disease sequence, the sequences of diseases will form by sorting the diseases of ascending order depends on the time interval.This disease sequences P i [S (d j )] are input for the KPrefixspan algorithm.

Proposed Mining Technique to Mine the Sequential Disease of the Patient from the Hospital Database
In the proposed algorithm we have developed the efficient algorithm to mine the sequential disease of the patient by overcome the challenges of TPrefixspan algorithm that is used by Wu and Chen (2007).The major complexities occurred in the TPrefixspan algorithm to mine the sequential diseases from the database are 1.Running time is high 2. Need large memory space.Bearing in mind the above challenges, we have proposed an efficient mining technique called as KPrefixspan algorithm for mining the sequential disease based on the time interval of each disease.In the proposed algorithm, the steps involved in mining of

Refined Database
The refined database is generated from the original database based on the threshold value.The count is calculated for each disease in the original database after that some of the diseases are removed from the original database which diseases having the count value below the threshold value.The below Table 2 represents the refined database of the original database.Assume here the threshold values is 3.

Construction of Disease Sequences with Starting Time and Ending Time for Each Patient
Each disease having the starting time and the ending time, when one disease comes simultaneously another disease will also come.By adapting this we can conclude the next disease of the patient and prevent them from the new disease before affected the patient.The pictorial representation of the disease sequence based on the time is given in below Fig. 1.
From the figure, in first the patient101 affected by disease b when it finished get the disease d starts after some time disease a start and disease d finished before disease a finished.Their after the disease e comes before it finished a starts again after a stats e get finished.We plotted the sequences of diseases of each patient in the following Table 3; it consists of each patient id and their corresponding sequences of disease.
From the Table 3 the patient id 101 consist of diseases a, b, d and e, the sequences of diseases are generated based on the time interval.For patient id 101 the disease b starts after that disease b end then disease d starts before disease a start and disease a ends before disease d ends.Then disease e starts before e ends disease a start again and a ends.By using this starting point and ending point of the diseases we plot the sequences of the diseases for the other patients based on the time interval.

JCS
Here after using the projection method we mine the sequences of diseases for whole database.

Mining the Sequences Based on the KPrefixspan Algorithm
Here, we present the TPrefixspan algorithm as KPrefixspan for mining the sequential patterns.The major advantage of this algorithm is to reduce the scanning time of the projected database and also the sequential pattern get accuracy.In this study we used the refined database for mine the sequential pattern since fewer amounts of data is used here.In this algorithm we first find the one length patterns from the refined database.The Table 4 shows the one length pattern of KPrefixspan algorithm.From the refined database here we find the one length the patterns for a sl , a e1 , b s1 , b e1 .
In order to find the sequential patterns we use the refined database, here we use the disease a sl for sequential pattern.
For the prefix a s1 the possible two length sequences are a s1 → a e1 , a s1 → d s1 , a s1 → d e1 , a s1 → e e1 , a s1 → e s1 → a e1 → be 1 .For example a s1 → a e1 is a two length sequence, the scanning is start after a e1 in the first projected database, here the scanning will be reduced massively since more amount of data's are removed from the first projected database due to not supporting of the threshold value.The following Table 6 shows projected database and sequential diseases for the three length pattern.The available three length patterns from the Table 5 are a s1 → a e1 → d e1 , a s1 → a e1 → e e1 , a s1 → d s1 → d e1 , a s1 → d s1 → a e1 .By using the above three length patterns find the possible patterns in the following.
From the Table 6 we can find the four length pattern.The above Table the four length patterns are find for the following three length patterns a s1 → a e1 → d e1 , a s1 → a e1 → e e1 , a s1 → d s1 → d e1 , a s1 → d s1 → a e1 from these patterns the four length pattern is derived from the only one three length pattern that is a s1 → d s1 → a e1 since only this three length pattern having the support for threshold value.The four length pattern is given below.Furthermore patterns are not available by seeing this four length patterns.a s1 → d s1 → a e1 → de 1 this four length patterns are describes disease d starts after the disease a start and the disease a finished before the Science Publications JCS threshold based on the threshold value and here the number of input values are constant as 5000.
The Fig. 6 illustrates the running time of our KPrefixspan algorithm is less when compared with the TPrefixspan algorithm.When the threshold value increase automatically the execution time get reduce since the unwanted patterns are increase when the threshold value increase then more number of unwanted patterns are removed by each projection consequently the results are achieved in a short period.
From the Fig. 7 the memory usage of our proposed KPrefixspan algorithm is less than the TPrefixspan algorithm because of every time the some of the events are removed from the database since the need of storing the events become reduces when the number of threshold value is increased.
The Fig. 8 illustrates our algorithm gets less number of patterns at the same time both of the algorithms are get more number of patterns for the threshold value 200.The numbers of patterns are decreased gradually for the KPrefixspan and TPrefixspan algorithm from the threshold value 300 to 500.
While we seeing the Fig. 9, the number of patterns of the KPrefixspan algorithm is decreased gradually when the threshold value is increased while comparing the result of KPrefixspan with TPrefixspan, the length of pattern of TPrefixspan get less.

CONCLUSION
In this study, we have presented the devising an efficient approach to discover time interval based sequential patterns from temporal database.In order to find the sequential patterns we have presented the efficient sequential pattern mining algorithm that is an improved version of the TPrefixspan algorithm.At first the databases are converted into the refined database by eliminating the unsupported threshold values from the original database.After that the input data's are converted into interval based format that will be the input of the proposed approach, the formatted patterns are sorted based on the time interval.Consequently the proposed approach is done and patterns are removed which patterns are having the values of below threshold value.At last we got the patterns in a sequential based on the time interval.Finally the experimentation has carried out on the synthetic datasets, from the experimental results we reduced the running time almost 60% and also reduce the memory usage almost 25% when compared to the existing TPrefixspan algorithm.
Science PublicationsJCSinterval based sequence of patient diseases are achieved with three major steps.They are:• Making the refined database from the original database • Constructing the disease sequences with starting time and ending time • Mining the disease sequences using the projection method of the disease P i [S(d j )] = Sequences of disease

Table 3 .Fig. 1 .
Fig. 1.Pictorial representation of the disease sequences of patient id 101 Mark the disease as l in the refined database in all patients.Scan after the disease a sl of each patient and place it into the projected database like d s1 → a e1 → d e1 → e s1 → a s2 → e e1 → a e2 , e e1 → d s1 → a e1 → d e1 → b s1 → d s2 → b e1 → d e2 , b e1 → e s1 → a e1 ↔ e s2 → e e1 → e e2 and a s1 → e s1 → b e1 → d e1 → a e1 → e e1 → de 1 .Count the each disease of the projected database, the count value for each disease are a e1 = 4, d e1 = 3, b s1 = 1, a s2 = 1, a e2 = 1, d e2 = 1, e s1 = 3, b e1 = 3, e e1 = 4, d s1 = 3.Here the threshold value is 3; remove the diseases from the projected database which disease having the count value below threshold here the removed disease is b s1 , a s2 , a e2 , d e2 since those diseases are not supported for threshold value.The balanced diseases are selected for the two length sequences like a e1 , d e1 , e s1 , b e1 , e e1 , d s1 .Likewise we need to proceeds the same procedure for the other one length patterns.The example one length pattern is given in the following the available two length patterns from the one length patterns are given below a s1 → a e1 , a s1 → d s1 , a s1 → d e1 , a s1 → e e1 , a s1 → e s1 , a s1 → b e1 , a e1 → d e1 , a e1 → e e1 , b s1 → b e1 , b s1 → e e1 , b s1 → a e1 , b s1 → a s1 , b s1 → e s1 , b e1 → e s1 , b e1 → a e1 , from this available two length patterns, the following Table 5 describes the finding of three length patterns for a s1 → a e1 , a s1 → d s1 → a s1 , a s1 → d e1 .From the Table5, a s1 → a e1 is one of the two length pattern that having two threshold support patterns like d e1 , e e1 .Here, also the unsupported threshold values are removed from the projected database for computing the three length patterns.

Fig. 2 .
Fig. 2. Illustrate the running time based on the number of input data

1.7. Pseudo Code
BeginCall P i [S(d j )] for all patient p i project disease for all d j in each p i calculate number of diseases N(d j ) if N(d j )<T h remove that disease d j from p i else go to next projection end; subroutine: Sequential disease P i [S(d j )] call Refined Database RDB for all patient p i sort diseases d j based on time

. Making of the Refined Database from the Original Database D
denotes temporal database with three attributes person ID, event type and time period.forinstancesome clinical records contains the attributes like patient Dim patient's disease and the time period of each disease, an instances of D is shown in Table1the time period for each diseases are recorded using t s and t e which are the beginning time and ending time of the disease respectively.

Table 1 .
Original database D

Table 2 .
Represent the refined database of the original database

Table 4 .
The two length sequences are generated from the above

Table 4 ,
by projecting the two length sequence we

Table 4 .
Projected database, sequence of disease for the one length prefix Prefix Projected database Counts Sequential disease a sl → d sl → a el → d el → e sl → a s2 → e el → a e2 a el = 4 d el = 3 a sl → a el , a sl → d sl