Conformance checking of process event streams with constraints on data retention

Conformance checking (CC) techniques in process mining determine the conformity of cases, by means of their event sequences, with respect to a business process model. Online conformance checking (OCC) techniques perform such analysis for cases in event streams. Cases in streams may essentially not be concluded. Therefore, OCC techniques usually neglect the memory limitation and store all the observed cases whether seemingly concluded or unconcluded. Such indefinite storage of cases is inconsistent with the spirit of privacy regulations, such as GDPR, which advocate the retention of minimal data for a definite period of time. Catering to the aforementioned constraints, we propose two classes of novel approaches that partially or fully forget cases but can still properly estimate the conformance of their future events. All our proposed approaches bound the number of cases in memory and forget those in excess of the defined limit on the basis of prudent forgetting criteria. One class of these proposed approaches retains a meaningful summary of the forgotten events in order to resume the CC of their cases in the future, while the other class leverages classification for this purpose. We highlight the effectiveness of all our proposed approaches compared to a state of the art OCC technique lacking any forgetting mechanism through experiments using real-life as well as synthetic event data under a streaming setting. Our approaches substantially reduce the amount of data required to be retained while minimally impacting the accuracy of the conformance statistics.


Introduction
Conformance checking (CC) techniques of the process mining field [1,2] fit concluded cases to a relevant process model in order to assess their conformity and detect deviations.When dealing with evolving event streams, online conformance checking (OCC) techniques check for such conformity of in-progress cases.In general, these techniques require the complete sequence of events constituting a case so that the estimated conformance is globally optimal.This complete sequence of events requirement implies that the entire sequence of events observed over an event stream shall be retained for OCC.As a repercussion, the memory limitation concern prevailing in data streams [3] gets further complicated in OCC.On top of that, privacy regulations, such as GDPR, discourage unlimited and indefinite storage of data.Compliance with the aforementioned constraints requires that events shall be forgotten at some suitable time.
State-of-the-art OCC techniques [4,5] suggest dealing with memory limitation by retaining only a limited number of active cases.Accordingly, the least recently updated cases in excess of the defined limit are forgotten by assuming them to be most probably concluded.However, due to the sparse distribution of events within cases or the number of in-parallel running cases outweighing the specified case limit, unconcluded cases may also get forgotten.Events belonging to such unconcluded, yet forgotten cases may still be observed in the future.In the literature [6], they are referred to as orphan events.If treated indifferently, OCC techniques will consider such events to be belonging to newly initiated cases.Accordingly, such cases will be false-negatively declared as non-conformant to the provided process model due to their forgotten prefix.The aforementioned scenario is termed as the missing-prefix problem [6].
Fig. 1 provides a simplistic overview of a prefix-alignmentsbased OCC setup.Events observed on the stream are appended to their respective cases residing in the memory, and these updated cases are checked for conformance with the relevant process model.In order to limit memory usage, a naïve approach forgets prefixes of cases or entire cases on the basis of aging criteria.Orphan events for such cases, for instance, the triangle-shaped case, are marked as non-conforming to the relevant process model, as depicted in its prefix alignment.
In order to effectively reduce the volume of event data required to be retained in the memory without suffering from the missing-prefix problem, we present two classes of approaches: Fig. 1.A simple overview of an OCC setup.The different shapes represent distinct and probably unconcluded cases observed on the event stream.These cases are stored in memory and are subjected to CC with the reference process model.Naïvely forgetting (partial or full) prefixes of cases in order to delimit memory utilization may result in unreliable results.
stateful and stateless.Besides the underlying mechanism, these two classes of proposed approaches differ in the trade-off between the memory consumption and the degree of certainty over the accuracy of the conformance statistics.We instantiate the proposed approaches in the context of online prefix alignments [7] which is a streaming variant of prefix-alignmentsbased CC [8,9].All these approaches retain a fixed number of cases in memory and prudently forget the cases in excess through designated forgetting criteria under an overflow situation.The stateful class of proposed approaches retains a meaningful summary when forgetting cases which also contains its current position in the process.Contrarily, the stateless class does not retain any sort of information regarding the forgotten cases.Upon observing orphan events, the stateful approaches avoid the missingprefix problem by resuming the prefix alignment computation from the earlier position of their cases contained in the summary.On the other hand, the stateless approach leverages machine learning classification techniques to predict the position of the forgotten cases in the process to be used for resumption of prefix alignment computation.The stateful and stateless terminology has been adapted from the distributed systems domain [10].
We demonstrate the effectiveness of our proposed approaches through an extensive experimental evaluation with real-life and synthetic event data by emitting their events as a stream.We conclude that with a trade-off on the accuracy of the estimated conformance, the proposed approaches significantly reduce the memory footprint and even the computational complexity of the online prefix alignments.The stateful approaches presented in this paper considerably advance the work presented in [11].
The rest of this paper is organized as follows.Section 2 provides an overview of the existing relevant work.Section 3 defines and explains some key concepts necessary for elaborating our proposed approaches which we present in Section 4. Details and findings of the experiments conducted to evaluate the proposed approaches are provided in Section 5. Finally, Section 6 concludes the paper along with some ideas for future work.

Related work
This section presents existing works that are relevant but do not entirely cater to the problem addressed in this paper.Section 2.1 presents the journey of CC in general.Section 2.2 briefly explains the adaption of CC to streaming environments.Section 2.3 lists the works which attempted to address the memory aspect of OCC.Finally, Section 2.4 lists some works where machine learning techniques are leveraged to solve relevant process mining problems.Additionally, some machine learning applications are listed which solve problems bearing resemblance with the problem solved in this paper in a field other than process mining.

Conformance checking (CC)
CC [2] on historical data, consisting of presumed to be concluded cases, has attracted a lot of attention from the process mining community.While rule-based CC techniques were already prevailing for some time, the ''token replay'' technique [12] was pioneering in the field.It replays cases in a log on its formal reference process model to detect deviations through missing or left-over tokens.The alignments-based techniques [9], which were devised to cater to the shortcomings in token replay technique, are considered as the standard in the CC field [13].The prefix alignments [8] variant of alignments was introduced for checking conformance of unconcluded cases.
Decomposition [14] and later Recomposition [15] techniques were proposed to divide and then unite alignments to reduce their computational complexity.The technique presented in [16] reduced the complexity of processing large cases by casting the computation of alignments as a resolution of ILP models and accordingly decomposing it for further computational efficiency.The technique proposed in [17] lent the symbol-based insights from the field of model checking to investigate the conformance with respect to large process models where typically the conventional alignments fail.The work in [18] proposed a customized cost function for alignments that focused on maximizing the number of correct matches instead of minimizing the costs, which also resulted in a reduction of the computation time.The aforementioned maximization of synchronous moves approach was advanced in [19] through the provision of means for analyzing non-conformant behavior.The technique proposed in [20] estimated the conformance of a log by only checking the conformance of its representative subset through the edit distance function, thus reducing the computations.The work in [21] further encodes the subset behavior, referred to as proxy behavior, in a trie data structure to logarithmically reduce the search space for alignment computation.The authors in [22] highlighted the unexplored walks of the CC landscape.Almost all the previouslymentioned techniques focused on improving efficiency of the CC techniques.However, their application in event streams was still hindered by virtue of their complexity and intrinsic suitability to concluded cases (except for prefix alignments), thereby necessitating the advent of stream-friendly techniques.Additionally, these techniques do not take the memory limitation aspect into consideration.

CC in event streams
As one of the seminal works in the area of OCC, the technique presented in [23,24] added non-conformant execution sequences to the legitimate sequences contained in the reference process model using regions theory [25].The cases following these deviating paths are diagnosed as non-conformant.Online prefix alignments [7] optimized the computation-wise expensive prefix alignments by introducing a lightweight model-semanticsbased prefix alignments synthesis method.The technique proposed in [26] performs exact CC in streams by incrementally computing the online prefix alignments [7].The proposed technique incrementally extends the synchronous product net upon observing new events to search for the shortest path.The behavioral-profiles-based technique [4] determined the conformance of pairs of streaming events with reference to the behavioral patterns constituting the relevant process model.The proposed technique is computationally less expensive in comparison to alignments.Additionally, it is able to deal with a warm start scenario, where the first observed event(s) for a case do not represent the starting position of cases as per the process model.This approach somehow abstracts from the reference process model and the markings therein.This makes it hard to closely relate and locate cases in the reference process model, which in turn is the spirit of alignments-based CC techniques.The OCC solution proposed in [27] used Hidden Markov Models (HMM) to first locate cases in their reference process model and then assess their conformance.Recently, the approach presented in [28] works with patterns of tasks observed over a certain amount of time instead of an in-hand normative process model, which most of the time is not readily available.The OCC techniques presented in this section addressed the computation complexity and inability to deal with unconcluded cases shortcomings of the offline techniques.However, the memory limitation by large remained unaddressed since the majority of these techniques assume the availability of infinite memory.

Memory awareness of OCC
In contrast to the process mining field, the issue related to the memory limitations and inability to store the entire stream has been adequately addressed in data stream mining [29][30][31][32].Nevertheless, some works provide a mechanism for dealing with memory limitation.The work in [33] generically suggested maintaining an abstract intermediate representation of the stream to be used as input for various process discovery techniques.The behavioral-profiles-based OCC technique [4] limited the number of cases to be retained in memory by forgetting inactive, i.e., least recently updated, cases.Forgetting cases on inactivity criteria may lead to the missing-prefix problem in processes with sparse distribution of events.Prefix imputation approach [6] has been proposed as a two-step approach for bounding the memory and yet avoiding the missing-prefix problem in OCC.The technique selectively forgets cases from memory and then imputes their orphan events with a prefix through guidance from the normative process model.Recently, the authors in [11] proposed a memory consumption reduction approach where only a specific number of states are retained for each case in memory.Although this approach is effective, memory consumption can still grow infinitely large at some point after observing an infinite number of cases.Our proposed approaches, catering mainly to the data retention aspect of OCC, are data-driven and address the missing-prefix problem.

Machine learning prediction techniques
In the last few years, machine learning techniques have been effectively leveraged in the process mining field.The most prominent application was in predictive process monitoring, where the target task is to predict for a case the next activity [34,35], a sequence of the future activities [36], the expected timestamp of the future activities [37], the remaining cycle time [38], or the expected final outcome [39].
Recently, solutions to target online process prediction has emerged with the main aim of continuously re-training a prediction model using most recent cases.For predicting the next activity along with accommodating the changes with time in process execution, the authors in [36] suggested using incremental models in order to reduce the retraining cycles of models.The authors demonstrated their approach using different machine learning techniques, for instance, incremental dynamic Bayesian networks and incremental neural networks.The approach presented in [40] is another novel application of machine learning techniques in the retail processes domain, where the future customer behavior was predicted in an incremental fashion.
The problem of predicting missing events in a sequence has been addressed in the literature mainly in the context of improving data quality.The technique presented in [41] used bidirectional continuous-time LSTMs to impute missing values in time-series data of point processes.In [42,43], approaches for applying adjusted-weight voting and ensemble-based prediction were respectively presented for handling missing data.In [44], a k-nn based approach was used to infer missing values based on the distances to the class center.The very recent technique presented in [45] learned marked temporal point processes (MTPP) from continuous-time event sequences with missing events.Then, an unsupervised training method was trained on the learned MTPPs which can be used to effectively impute the missing data among the observed events.The previous approaches represent a category of machine learning methods that limits its focus to predicting some missing events (either in the prefix or in the suffix) of the sequence.Different to their aim, our target is to infer the complete prefix of a trace through a prediction model.Additionally, our prediction model can benefit from the existence of a process model which was used in the first place to decide which cases can be forgotten as we will explain later.

Preliminaries
In this section, we define some fundamental concepts related to our proposed approaches.Before proceeding with the definitions, we present some notations to be used therein in the following paragraph.Let X denote a set and y is not an element of X , i.e., y / ∈ X , and let X y represent X ∪ {y}.We represent a sequence σ of length k as σ = ⟨σ (1), σ (2), . . ., σ (k)⟩ where for 1 ≤ i ≤ k, σ (i) represent the ith element of σ .Sequences σ ′ and σ ′′ can be concatenated as σ ′ •σ ′′ .Empty sequences are represented as ϵ.X * represent the set of all possible sequences over X .For a sequence Additional notations, which are mostly standard in the process mining, are defined if required in their respective definitions or upon usage.
Process model.A (business) process is formally represented by a process model.While multiple representations exist for modeling a process, we consider the highly expressive Petri nets.

Definition 1.
A Petri net is represented as a tuple N = (P, T , F , λ), where P represents a finite set of places, T a finite set of transitions, F ⊂ (P × T ) ∪ (T × P) a set of flow relations between places and transitions.λ is a labeling function assigning transitions in T with labels from the set of the activity labels Λ ∈ A, where A is the universe of activities.The activities in a process model N entail certain behavioral relations.For instance, if activity A in a process is always followed by activity B, then the relation is termed as a sequence.Other behavioral relations include choices, concurrency, and looping.The sequence of activities executed in the context of the same process instance is referred to as a case.The stage or the position of a case with reference to its process is represented through a marking M in its process model N. A marking M is a multiset of tokens over the places P in the process model N, i.e., M : P → N. In a normal course of events, every case shall start in the initial marking M i and eventually end in the final marking M f .Silent transitions τ (Taus), which do not correspond to process activities, are used in Petri nets for the completion of routing.A transition t having a token in each of its input places, as part of a marking M, is said to be enabled, represented as M[ t⟩.An enabled transition can fire or execute and thereby consume a token from each of its input places and accordingly produce a token in each of its output places, resulting in a change of marking from M to M ′ represented as M[ t⟩M ′ .The consecutive firing of a sequence of (enabled) transitions σ ∈ T * starting in a marking M and leading to M ′ is referred to as a firing or execution sequence in M and represented as M σ − → M ′ .Typically, the set of execution sequences of a process model N is finitely large in the absence of loops and infinitely large in the presence of loops.An endto-end execution sequence σ starting in M i and ending in M f is referred to as a complete execution sequence of N, i.e., M i σ − → M f .A marking reached from M i through an execution sequence of N is referred to as a reachable marking in N. In this work, we consider the workflow nets class of Petri nets, which are characterized by unique input and output places, and all its transitions are on a path from input to output place.We refer to the activities represented by the transitions consuming token from the input place, i.e., enabled in initial marking M i , as start-markers.Similarly, the activities represented by the transitions putting a token into the output place, i.e., resulting in final marking M f , are referred to as end-markers.

Definition 2.
An event e minimally consists of (1) the case identifier to which the event belongs, (2) the corresponding activity name represented as # activity (e), (3) the timestamp of the execution of the activity represented as # time (e), and (4) an event identifier.
Events referring to the same activity, having the same timestamp, and belonging to the same case but bearing different event ids are contextually different and distinct.The sequence of events σ belonging to a case is referred to as its trace.|σ | denotes the length, i.e., the number of events, in a trace.Usually, events belonging to the same case are ordered either on the basis of their timestamps or event ids to generate an event log L. Furthermore, we refer to the first event e in a trace whose corresponding activity matches with one of the start-markers as start-marker event, i.e., ∃ t∈T (λ(t) = # activity (e) ∧ M i [ t⟩).Similarly, an event e in a trace whose corresponding activity matches with one of the end-markers is referred to as end-marker event, i.e., ∃ t∈T (λ(t) = # activity (e) ∧ M[ t⟩M f ) where M is a (reachable) marking in N.
Event stream.An event log L consists of historical or concluded cases, while the cases observed in a stream evolve and new cases continuously arrive as well.Definition 3. Let C be the universe of case identifiers, and let A be the universe of activities.An event stream S is an infinite sequence of events over C × A, i.e., S ∈ (C × A) * .A stream event is represented as (c, a) ∈ C × A, denoting that activity a has been executed in the context of case c.We consider the position at which a stream event arrives, i.e., S(i), as its event id.Therefore, any two observed stream events are considered unique and distinct, even if their respective activities and case identifiers are the same.
Alignments.By virtue of a variety of contextual factors [1,2], activities in different cases may be executed in different and diverse order.They may even deviate from the legitimate behavioral relations represented by its relevant process model.CC techniques compare cases with their reference process model to assess their conformance and highlight any deviations.Many techniques have been developed, but alignments has been positioned as the standard for checking the conformance of cases [13].Definition 4. Let σ be a trace and N = (P, T , F , λ) a Petri net process model.Let M i , M f be the initial and final markings of the N, respectively.Let A be the universe of activities and Λ be the set of activity labels.Let ≫ / 1. Projection of the activity part of γ on A equals σ , i.e., (π 1 (γ )) ↓A = σ .2. Projection of the model part of γ on T is a complete execution sequence of N, i.e., M i 3. (≫, ≫) is not valid in alignment γ , i.e., ∀ (a,t)∈γ (a ̸ =≫ ∨ t ̸ =≫).Alignments γ explain the sequence of events in a case through a complete execution sequence in the reference process model [2].A case is considered conformant or fitting if a complete execution sequence exists that entirely explains the events in its trace.Otherwise, the case is considered as non-conformant if a complete execution sequence either cannot explain any of the events in its trace or any element of the execution sequence cannot be mapped to an event in the trace.The extent of the non-conformance of cases is measured through the degree of the aforementioned inexplainable events in their traces and unmappable elements of the maximally-explaining complete execution sequence.Consider the trace for Case ''1'', i.e., ⟨A, B⟩, of the event log depicted in Table 1.For this trace, ⟨ (A,A), (B,B), (≫,C), (≫,E) ⟩ and ⟨ (A, ≫), (B, ≫), (≫,A), (≫,B), (≫,C), (≫,E) ⟩ are two of its many possible alignments γ with the process model of Fig. 1.
These two alignments fulfill all the three requirements of Definition 4. Each pair in these alignments is known as a move m, for instance, (A, A) where the second A in the pair is the label of the corresponding transition in the aligned complete execution sequence.A move is usually represented through its position in the alignment, i.e., γ (i), where i ≤ |γ |.Moves m without skipsymbol ≫ are termed as synchronous moves and imply that an enabled transition with the same label as the activity represented by the event of the pair is available in the current marking.Moves m with ≫ in the trace part are referred to as model moves, while moves m with ≫ in the model part are referred to as log moves.
The latter two types of moves illustrate that the artifact with ≫ is missing an explanation for its counterpart.
As illustrated in the previous paragraph, multiple alignments of a trace may be possible.Therefore, a cost function δ assigns moves with a move cost in order to rank different alignments, i.e., δ : Usually, synchronous moves and model moves with silent transitions (≫, τ ) are assigned a zero cost.
Accordingly, alignments are ranked on the basis of trace fitness cost.
Definition 5. Let σ be a trace.Let γ be an alignment of σ .The sum of cost of all the individual moves of the γ is referred to as its trace fitness cost, i.e., κ δ (γ ) = ∑ |γ | i=1 δ(γ (i)).Alignments look for an optimal alignment γ opt which bears the least trace fitness cost.However, multiple optimal alignments may also exist for a trace.Usually, the δ is a unit-cost function where synchronous moves or moves with τ in model part are assigned with 0 cost, while log and model moves are assigned with a cost of 1. Accordingly, the δ is omitted from the representation of the trace fitness cost, i.e., κ(γ ).
Prefix alignments.Alignments assume cases to be concluded.However, we may be interested in checking the conformance of unconcluded cases as well.For this purpose, the prefix alignments variant of the alignments is more appropriate.Definition 6.Let σ be a trace and N = (P, T , F , λ) a Petri net process model.Let M i , M f be the initial and final markings of the N, respectively.Let A be the universe of activities and Λ be the set of activity labels.Let ≫ / Essentially, prefix alignments γ relax the second requirement of Definition 4 of alignments by allowing the model part of γ necessarily not be a complete firing sequence.Rather, it can be a firing sequence of M i and that final marking M f is still reachable.The rest of the concepts, such as moves and their associated costs, are the same as in alignments.Consider again the trace for Case ''1'', i.e., ⟨A, B⟩, of the event log of Table 1.⟨ (A,A), (B,B) ⟩ repre- sents one of the possible prefix alignments for this trace with the process model of Fig. 1.Similar to conventional alignments, trace fitness cost is used to rank prefix alignments for the purpose of identifying the optimal one, i.e., γ opt .Online prefix alignments.Conventional alignments compute alignments only once for each case.For checking conformance of cases in event streams, a (prefix)alignment needs to be computed upon observing every single stream event (c, a).(Prefix)alignments computation through (A * -based) shortest-path search in a synchronous product [46] is computationally expensive.Therefore, the approach in [7] tailored prefix alignments to be efficient in checking conformance on streaming events, referred to as online prefix alignments in this work.This approach first checks if activity a of the observed stream event (c, a) corresponds to a transition t ∈ T that is enabled in the marking M of the previously computed prefix alignment γ of c (or M i if the event is the first for the case).If such a transition exists, a synchronous move (a, t) is appended to the previously computed prefix alignment.Otherwise, a fresh optimal prefix alignment γ opt is computed for the trace of c through a shortest-path search in the synchronous product, starting from the initial marking M i .In this work, we refer to the former method of extending a previously computed prefix alignment as model-semantics-based prefix alignments and the latter method of computing a fresh prefix alignment as shortest-path-search-based prefix alignments.The latter method is computation-wise much more expensive than the former.This approach enriches moves with further information which are referred to as states.Definition 7. A state is represented as a tuple s = (m, δ, M), where m represents the move, δ the cost assigned to the move m, and M the marking reached through the sequence of the moves of all the preceding states and m.
Therefore, a prefix alignment can simply be represented as a sequence of states which we denote as ⟨γ (i)⟩ i=1...z , where z is the number of states in γ , i.e., z = |γ |.We can represent the marking M of a state s as M(γ (i)) and its move cost as δ(γ (i)), where i is the index of the move in the prefix alignment.The state s should not be confused with that in the stateful and stateless terminologies used in this work.The latter refers to the collective information regarding a sequence of states s constituting the prefix alignment γ of a forgotten prefix, for instance, the marking reached M and/or the alignment cost κ(γ ).For illustration of states, consider a trace ⟨A, E⟩ and ⟨ (A,A), (E, ≫) ⟩ as one of its possible prefix alignments with the process model of Fig. 1.The first move, i.e., (A, A), is represented as a state as ((A, A), 0, [p 1 , p 2 ]) where (A, A) is the actual move, 0 is the cost of the move, and is the marking reached with the sequence of the prefix and current move.Similarly, the second move is represented as ((E, ≫ ), 1, [p 1 , p 2 ]).The fitness cost of this alignment, i.e., κ(γ ) is the sum of cost of these two individual moves which is 1 in this case.
Online prefix alignments store the computed prefix alignments in a case-centric memory, referred to as D C .Let Γ denote the universe of possible prefix alignments then D C : C × N → Γ , where N denotes the set of natural numbers.For i ≥ 1, D C (c, i) represents the currently known prefix alignment related to case c after receiving events S(1), S(2), . . ., S(i).Upon observing S(i), the most recent prefix alignment of its case, i.e., i − 1, is retrieved to be extended for the activity in S(i).Interested readers are referred to [1,2,7] for a sound understanding of the presented and other related concepts.

OCC with constraints on data retention
In this section, we present our proposed approaches which effectively reduce the memory footprint without suffering from the missing-prefix problem.The proposed approaches are classified into two classes: stateful and stateless.The approaches presented in Section 4.1 belong to the stateful class, while the approach presented in Section 4.2 is categorized as stateless.Both the stateful and stateless approaches consist of two steps.As a first step, these approaches reduce the data to be stored in memory.For this purpose, the sequence of states in the prefix alignment of the cases present in memory is partially or fully forgotten through a forgetting criteria.As a second step, the distinct underlying mechanism of the proposed approaches enable proper computation of prefix alignment for the orphan events of such cases.

Definition 8.
For a case c with trace ⟨e 1 , e 2 , . . ., e i ⟩, its sequence of the events is referred to as orphan events if e 1 is not a case-starter event, i.e., ∄ t∈T (λ(t) = # activity (e 1 ) ∧ M i [ t⟩).Referring to the discussion on online prefix alignments in Section 3, for a stream event (c, a), the model-semantics-based prefix alignments method requires the current marking M of the prefix alignment γ of c to extend it for a.Alternatively, the shortestpath-search-based prefix alignments method recomputes the γ from initial marking M i in order for it to be globally optimal.Our proposed approaches relax the need for the global optimality of γ and exploit the aforementioned properties of online prefix alignments for proper computation of prefix alignment for the orphan events.We adapt the case-oriented memory D C of [7] for storing the computed prefix alignments.

Carry-forward marking and cost
In this section, we first present our primary stateful approach and then its further constrained variant.

Carry-forward marking and cost with bounded cases (CFc)
In this primary approach of the stateful class of proposed approaches, we define a limit n on the maximum number of cases to be retained in D C .In other words, these n cases can retain an unlimited number of states in their prefix alignment.For all the other cases, their actual prefix alignment is forgotten.However, a single special state is retained for each forgotten case in a repository R C .This special state carries forward the summary of the forgotten prefix alignment, i.e., marking M reached with the prefix alignment and the cumulative cost of its moves and is referred to as a summary state.Despite forgetting cases for reducing memory consumption, our aim is to minimize the error in estimating the fitness cost of the forgotten cases.Therefore, instead of forgetting naïvely, we prudently forget the prefix alignment of cases according to designated forgetting criteria which we explain later in this section.
Upon observing an orphan event, we retrieve the marking M of its forgotten prefix γ from the summary state in R C .Accordingly, its prefix alignment computation γ ′ is resumed from marking M, instead of M i .Additionally, the cost carried forward by the summary state is added as residual to that incurred by the orphan event(s).As a result, the effective trace fitness cost of the case is κ(γ ) + κ(γ ′ ).Through the aforementioned mechanism, we increase the probability of correctly estimating the conformance of cases even with forgotten prefixes.It may be noted that the model part of γ ′ is an execution sequence in M, instead of M i .Therefore, γ ′ may not necessarily be globally optimal.
The proposed approach is referred to as ''Carry-forward Marking and Cost with Bounded Cases'' and is represented as CFc in the rest of this paper.The maximum data stored at any point by Assume that these cases are forgotten at this stage by our forgetting approach.Observing an orphan event C for such cases will not fit in marking [p 1 ] stored in the summary state and hence falsely considered as log move thereby causing the trace fitness cost to be over-estimated.As a worst-case alternative scenario of the aforementioned effect, firing Transition t 7 on observing event D in marking [p 5 ] for a case will lead it to the final marking, i.e., [p 0 ] and any following event(s) will be (falsely) marked as log moves.For easy referencing in the rest of the paper, we refer to the above illustrated effect as multi-choice effect.As evident from the above discussion, the potential of CFc to overestimate the fitness cost of cases due to multi-choice effect can result in classifying conformant cases as non-conformant.
Forgetting criteria.The forgetting criteria select cases to be forgotten in a way that minimizes the error in estimating the trace fitness cost of these cases.To achieve this, the forgetting of conformant cases is prioritized over non-conformant cases.As a result, non-conformant cases are retained for longer so that their computed prefix alignment is locally optimal.As an illustration, consider the example process model of Fig. 2 and a sequence of events ⟨A, F ⟩ for a case c.Suppose the prefix alignments compute ⟨(A, A), (F , ≫)⟩ as the optimal prefix alignment with a fitness cost of 1 and reached marking [p 2 ].Assume that an event G is observed next.Having retained the previously computed prefix alignment, a fresh prefix alignment will be computed as ⟨(A, A), (≫, E), (F , F ), (G, G)⟩.Consider a different scenario where the non-conformant ⟨(A, A), (F , ≫)⟩ is forgotten.In such a case, even with retaining the reached marking [p 2 ], the prefix alignment computed for G will not be optimal and result in wrongly estimated fitness cost for c.We refer to such forgetting of log moves and accordingly their inability to influence future alignments, such as F in this example, as semi-premature forgetting of events.
Our forgetting criteria define a set of conditions in order of priority.The single-pass forgetting criteria assign each case in D C a forgetting priority in accordance with the condition it fulfills.Once a case with a certain forgetting priority is found, we narrow the search down to cases with a higher forgetting priority.Upon finding a case fulfilling Condition 1 below, we completely stop the search process as we already found the most suitable case to be forgotten.In the following, we briefly explain the conditions constituting our forgetting criteria with the help of example prefix alignments provided in Table 2.
Algorithm 1 Online Prefix Alignments with CFc (c, a) ← S(i); compute γ ′ through model semantics or shortest-path search [7] 15: 1.This condition looks for compliant cases with a single event in its trace which is also a start-marker.The prefix alignment for the orphan events of such a case will still likely be globally optimal.Such a case, therefore, is assigned the highest forgetting priority.For instance, case ''3'' in Table 2 fulfills this condition and hence assigned with a priority of 1.
2. A case with residual cost > 0 imply that its forgotten prefix was non-conformant.Since the prefix alignment for a forgotten prefix cannot be revisited, this case will remain non-conformant forever.Such a case is assigned with the second highest forgetting priority.Case ''2'' in Table 2 fulfills this condition and hence assigned with a priority of 2. 3. The prefix alignments of complete conformant cases are optimal.Accordingly, the prefix alignments of their orphan events are expected to remain optimal starting from their current marking.Such cases are therefore assigned the third highest forgetting preference.Case ''4'' in Table 2 fulfills this condition and hence assigned with a priority of 3. 4. Cases with zero residual cost but non-zero total fitness cost imply that not all of their in-memory events are fitting.
We assign to such cases the least forgetting preference.
Especially cases having non-fitting events in their tail are our last choice because of semi-premature forgetting consideration.Cases ''1'' and ''5'' in Table 2 fulfill this condition and hence assigned with a priority of 4.
Algorithm 1 provides a summary of the CFc approach.In addition to the input requirements of [7], the algorithm requires a case limit n as input.Upon observing an event (c, a) on the stream S, the algorithm fetches any previously computed prefix alignment for the case c existing in D C in Lines 5-7.If a previously computed prefix alignment does not exist in D C , a summary state is looked for in R C in Line 9. In Line 14, if a previously computed prefix alignment is found or a summary state is retrieved then the prefix alignment computation is resumed.Otherwise, the case is treated as a new case and a prefix alignment from M i is computed, either through model semantics or shortest-path search method.This latest prefix alignment is stored in D C in Line 15.If a recently observed case c overflows the case limit n then the forgetting criteria forget a suitable case in D C in Lines 10-13.In Line 12, a summary state consisting only of the cost incurred and the marking reached by the prefix alignment of the forgotten case is stored in R C .

Carry-forward marking and cost with bounded cases and states (CFcs)
The CFc approach reduces memory consumption by limiting The forgetting in this approach is two-dimensional.As in CFc, the prefix alignments of the cases in excess of n are forgotten using the forgetting criteria and a summary state for each forgotten case is stored in R C .Additionally, in a first-in-first-out fashion, we forget the prefix state(s) in excess of w for all the prefix alignments residing in D C .A summary of the forgotten states is prepended as a summary state to the surviving suffix states such that the cumulative number of states is maximally w.As in CFc, the summary state carries forward the marking and cost of the forgotten prefix state(s).
Upon observing an event belonging to a γ with a prepended summary state in D C , a shortest-path-search-based prefix alignments computation (if required) computes the prefix alignment γ ′ starting from the marking retained in the prepended summary state of γ .Similarly, we compute the prefix alignment γ ′ of an orphan event starting from marking M of its forgotten prefix γ which we retrieve from its summary state stored in R C .For both the aforementioned situations, we add the cost carried forward by the summary state as residual to the cost of γ ′ .Hence, the effective trace fitness cost of such a case also takes into account the cost of its forgotten prefix states.The two-dimensional bounding of D C reduces the maximum data retention at any point to n × w + |R C | states.We use the same forgetting criteria as defined for CFc.
In general, CFcs inherits all the characteristics of CFc.However, bounding the states by CFcs may increase the error in estimation of fitness costs in comparison to CFc.The aforementioned effect is related to the fact that a longer prefix has more chances of getting an (at least locally) optimal alignment in comparison to a shorter prefix.
Algorithm 2 provides a summary of CFcs.(c, a) ← S(i); compute γ ′ through model semantics or shortest-path search [7] 15: if |γ ′ | > w then 16: an additional input.Further, after the computation of the prefix alignment for a case c in Line 14, its prefix states are forgotten in order to comply with the limit w in Lines 15-17.In Lines 16-17, a summary state consisting only of the cost of the aforementioned forgotten prefix states and the marking reached with these states is prepended to the surviving states.

Machine learning based marking prediction (MLc)
The stateful approaches presented in the previous sections reduce the data to be retained in D C .However, with the passage of the process, the cumulative memory consumption of D C plus R C may grow infinitely large as R C is unbounded.In this section, we present a novel two-step stateless approach to overcome the aforementioned shortcoming of the stateful approaches.Like stateful approaches, the forgetting step of the proposed approach prudently forgets cases in excess of the defined limit n on the basis of the forgetting criteria.However, no information is stored regarding the forgotten cases, thereby categorized as stateless.Upon observing orphan events for the forgotten cases, the prediction step predicts their parent marking through a machine learning classifier.Definition 9.In a case with trace ⟨e 1 , e 2 , . . ., e i ⟩, for an event e k , where k ≤ i, the marking M reached by the prefix alignment γ of the sequence of its prefix events ⟨e 1 , e 2 , . . ., e k−2 , e k−1 ⟩ is termed as its parent marking.
The predicted parent marking is considered as the marking reached by the events constituting the forgotten prefix of the case.Accordingly, the prefix alignment computation is resumed from the predicted parent marking.The proposed approach is referred to as ''Machine Learning based Marking Prediction'' and represented as MLc in the rest of this paper.
Typically, a process model contains multiple reachable markings.Therefore, a multiclass classifier H is suitable for the prediction of a parent marking [47].In order to train H, relevant historical event data or that accumulated over the stream for a certain period can be used.The activities represented by events serve as the primary features for the instances x of the training data (x, y).Subsequences of size f of the events form the f features of the instances x.The parent markings of these f events, which are reachable markings in the process model N, serve as the class labels, i.e., y ∈ {M 1 , M 2 , . . ., M K }.For instance, for a case with trace ⟨e 1 , e 2 , . . ., e i ⟩, a feature size of f = 3 will result in having all the subsequences of events of size 3, i.e., ⟨e k , e k+1 , e k+2 ⟩ where 1 < k ≤ i − 2, as features of distinct instances.For each of the instance ⟨e k , e k+1 , e k+2 ⟩, the marking reached with prefix alignment of the sequence of its prefix events, i.e., ⟨e 1 , e 2 , . . ., e k−2 , e k−1 ⟩, in the relevant process model will serve as its class.It is worth mentioning that in addition to the activities, any other explicit or implicit information contained by events can be used for enriching the features which will most probably improve the accuracy of the classifier.For instance, the day or part of the day at which the activities were executed, resources executing the activities, or any additional context information of the process are suitable to be considered as features.However, in this work, we use the bare activities as features.Such a feature space might also be served with rule-based models.However, to keep the approach scalable and able to deal with diverse feature spaces, we use a machine learning model.In the prediction step, H is used to predict the parent marking ŷ where ŷ ∈ {M 1 , M 2 , . . ., M K }, of the sequences of f number of orphan events.The CC technique, which is online prefix alignments in our setup, uses this predicted parent marking to resume the prefix alignment computation for the case represented by the orphan events.We refer to cases with an orphan sequence of length < f as developing cases.
An adequate f size is greatly related to the characteristics of the reference process model.In a process model without any parallelism, a single place with a token determines the marking M. Unless such a process model bears label-duplication, f =1 is sufficient to predict the parent marking for orphan events.However, process models containing a parallel construct usually require multiple places with a token for specifying M. The actual number of such places depends on the number of branches in the parallel construct.In such scenarios, at least an f equal to the number of branches is recommended.Nevertheless, such an f cannot guarantee the complete determinism of M if either some of the orphan events are noisy or multiple orphan events correspond to transitions in the same branch of the construct.By virtue of the latter scenario or label-duplication in the process model, it is customary that instances with the same features bear different classes, i.e., parent markings.We refer to such instances as anomalous instances in this work.It may be relevant to mention that anomalous instances can be catered more effectively with multi-class multi-label classification.However, we restrict the scope in this work to multi-class classification.
Having sketched an adequate f , however, the accumulation of at least f orphan events is mainly dependent on the distribution of events within cases in the stream.Having no influence over the arrival rate of cases and events, nevertheless, we can select a suitable f on the basis of the characteristics of the event stream.Fig. 3 depicts the composition of a part of BPIC'12 real event data in terms of the length of the sequences of events which belong to the same case and are consecutively observed.We use this part of BPIC'12 as test set in our experiments for MLc.As can be concluded from Fig. 3, an f = 1 caters to all the observed sequence sizes, while f = 2 may lead to premature forgetting of sequences of size 1 in case of n overflow.The limit n also plays a role in minimizing the aforementioned undesirable effect.Increasing n can potentially facilitate cases to reach the required f without being prematurely forgotten.
The accuracy of our classification decisions is partly determined by the orphan events observed after the decision.The predicted marking is considered as parent marking for computing the prefix alignment of the orphan events and the events observed later on.Therefore, a predicted marking shall also remain valid as parent marking for the events observed later than the Different to prefixes, a sequence here represents the number of consecutive events in any part of the trace.

Algorithm 3 Online Prefix Alignments with MLc
(c, a) ← S(i); if observed event is a start-marker then compute γ ′ through model semantics or shortest-path search [7] 17: f orphan events.Anomalous instances pose to be a challenge to the aforementioned aspect.Additionally, desire lines [1] are an important factor in determining the durability of our classification model.In simple terms, desire lines indicate that some paths in the processes are followed more frequently than others.Therefore, cases in real-life processes are skewed with respect to their traces.Consequently, as is the case with concept drift in typical machine learning problems, our classifier may need re-training in case of a shift in the desire lines.Noise can adversely impact the overall accuracy of MLc.Referring to Definition 8, a case with swapped first event is considered as orphaned.Accordingly, it is subjected to a parent marking prediction which may result in imprecise estimation of its fitness costs.As another scenario, missing events can get unaccounted in between a forgotten prefix and the subsequent orphan events.As an example, consider a case with trace ⟨A, E, G⟩.Assume that its prefix alignment with the process model in Fig. 2 is forgotten at stage ⟨(A, A), (E, E)⟩.With f = 1, MLc predicts a parent marking of [p 6 ] for the orphan event G.The event G fits in [p 6 ] and the case is declared to be conformant, whereas the trace ⟨A, E, G⟩ is missing the event F and is non-conformant.Such escape of the noisy events is referred to as masking of noise in the literature [6].
Forgetting criteria.We have adapted MLc's forgetting criteria by taking in to consideration the notion of case completion as well.Since MLc completely forgets cases, conformant cases that have been concluded (and are apparently redundant) are the best candidates for forgetting.Additionally, we assign a high forgetting priority to conformant cases with a predicted parent marking.The prefixes of such cases were most probably conformant and hence forgotten.
Algorithm 3 provides a summary of MLc.Besides the input requirements of [7], this algorithm requires the case limit n and the feature size f as input.For a case c existing in D C , the prefix alignment is updated either through model semantics or shortestpath search method in Line 16.Alternatively, if c does not exist in D C then the algorithm initializes M i as initial marking if the first observed event for a case c is a start-marker event in Lines 12-13.Otherwise, the approach predicts a parent marking M to be utilized as (a proxy for) initial marking.In either case, the algorithm computes the prefix-alignment from the designated initial marking in Line 16.For the sake of simplicity, parent marking has been shown to be predicted on the basis of a single event.However, for f >1, the approach waits for accumulation of f orphan events and then predicts the parent marking.Eventually, the latest prefix alignment is stored in D C in Line 17. Lines 9-11 perform the house-keeping of the D C and completely forget suitable cases in excess to n upon overflow.

Comparison of the proposed approaches
The stateful approaches retain summary state for the forgotten cases and hence the memory consumption may grow unbounded at some point.Contrarily, the stateless approach does not retain any sort of information regarding the forgotten cases and hence the memory consumption is considered to be bounded.In case of stateful approaches, prefix alignments computation of the orphan events is guided by the summary stored regarding the forgotten cases.In other words, these approaches are up to certain extent prone to the noise in the orphan events.In contrast, referring to the discussion on the impact of noise, the accuracy of the stateless approach is completely dependent on the quality of the orphan events.Further, the underlying mechanism of the stateful approaches is straight-forward to be implemented, whereas the stateless approach requires a machine learning model which in turn requires historical data in order to be trained.Based on the aforementioned factors, the stateful approaches are recommended in settings where the inter-arrival rate of cases is low and the dependability on the conformance statistics is a major consideration, whereas the stateless approach is more suitable for processes with high inter-arrival rate of cases and where memory consumption needs to be strictly constrained.

Experimental evaluation
In this section, we present the results of the experimental evaluation of our proposed approaches with real-life and synthetic event data.First, we provide some details of the experiments, for instance, the environment, the event data, and the different parameters used in the evaluation.Then, we present the results of the experiments with the stateful approach discussed in Section 4.1.1 and its variant presented in Section 4.1.2.Finally, we present the results of the stateless approach discussed in Section 4.2, followed by a short discussion on the overall results.For the sake of convenience, we provide a list of the frequently used notations in Table 3.  BPIC'12 and a22f 0n00_ρ1 are plotted on the primary Y -axis, while rest of the logs are plotted on the secondary Y -axis.

Experimental setup
Prototype.The proposed approaches are evaluated through a prototype implementation 1 .It is dependent on the Online Conformance package [7] which uses the A * algorithm [48] for shortestpath-search-based prefix alignments computation.This package requires as input a Petri net process model N, its initial marking M i , and its final marking M f .Additionally, CFc requires the case limit n, CFcs requires both w and n, and MLc requires n and f .A majority of the multiclass classifiers can serve the purpose for MLc.However, we are using the Weka's [49] Random Forest as classifier for all our experiments with the stateless approach.The working mechanism, including the format of the input and output artifacts, has been detailed in Section 4.2.We evaluate all our proposed approaches through a real-life event log which we elaborate in the following paragraphs.Therefore, we adapt the strategy of mimicking an event stream through emitting the events in the considered event logs on the basis of their timestamps in order to be processed by our prototype.Alternatively, PLG2 tool [50]  Event data.For evaluation through real-life event data, we use the event data of the application process and its integral offer subprocess of the Business Process Intelligence Challenge (BPIC'12) [51] in all experiments.This sizeable real event data is related to loan applications made to a Dutch financial institute and contains 13 087 cases consisting of 92 093 events.The reference process model has been developed by a process modeling expert in consultation with the domain knowledge experts from the financial institute.The reasons for selecting this event data include (1) the relatively complex reference process model containing the majority of the constructs, (2) the multiple types of event noise prevailing in the data, and (3) the magnitude of the log and the number of parallel cases therein.Fig. 4 highlights the arrival rate of cases in the part of the BPIC'12 which we use as test set for experiments with MLc.After observing approximately 25% of the events, more than 700 cases are open and running in parallel.Fig. 5 provides the dotted chart of the complete BPIC'12 event log.The arrival rate of cases and execution of certain activities in batches are the factors which contribute to the high degree of parallelism of cases.We also evaluate our proposed approaches with synthetic event data.For this purpose, we use the rich set of event logs generated in [52], which entail diverse process and log characteristics.We consider three sizes of the process alphabet, i.e., a12, a22, and a32, where a refers to activities.For each a, we consider three execution priorities of choices and parallel constructs, which we refer to as skewness of decisions level and represent as 0, 1, and 2 in this work.In essence, the different skewness levels influence the priority of the order of execution of the branches of parallel constructs and the priority of a branch to be executed in choice constructs.0 entails an equal priority of the candidate branches, while 1 and 2 have ratios of 0.25 : 0.75 and 0.05 : 0.95, respectively.For the three skewed versions  of each of the a log, the authors in [52] introduce five levels of noise.These noise-inducing mechanisms include deleting the head, the tail, a random part of a trace, or swapping the position of events.We denote these five noise levels as η0, η5, η10, η20, and η50, where the number after η represents the percentage of the cases being altered.For instance, a η50 log indicates that 50% of its cases have been induced with noise.In order to analyze the impact of parallelism of cases on our proposed approaches, we created multiple variants of each of these logs with different arrival rate of cases and events within these cases, referred to as ρ.Fig. 4 illustrates ρ for the five variants of a22f 0n00 event log.There, ρ1 represents the maximum level of parallelism where all the cases in the log are running in parallel, whereas ρ5 represents absence of parallelism, i.e., a new case is initialized only when the preceding case is concluded.We retain the same distribution of cases and events within the five noise levels of each skewness level  of a.However, we present the results of only five selected variants due to the space limitation and comprehensibility.Each of these individual logs and different subsets of logs assesses and uncovers different aspects of our proposed approaches.For a thorough understanding of the aforementioned process characteristics and details of the logs generation, we refer the readers to [52].Table 4 provides a summary of the event data used in this experimental evaluation.
In experiments with the stateful approaches, we realize an event stream through dispatching all the events in the log (based on their actual timestamps).However, we adopt different strategies for the real and the synthetic event data for experiments with the stateless MLc.With real data, as suggested in [53], we consider only the concluded cases and hence are left with 12 688 cases in total.We temporally split the data into training and test sets such that the timestamp of the latest event in the training data is earlier than the earliest event in the test data, the configuration depicted in Fig. 5.As evident, the blue-colored events of the training data are earlier in time than the magenta-colored events constituting the test data.For this purpose, temporally overlapping cases with green-colored events are filtered out.This arrangement is necessary to avoid leakage of any future information in the training data [53].As a result, the training set consists of 6029 cases (approx.51% of total cases), while the test set contains 5714 cases (approx.49% of total cases).With synthetic data, we perform 10-fold cross-validation and, as such, do not restrict the timestamps of the train and the test sets as we did with BPIC'12.The aforementioned approach is valid since the data here is synthetically generated and doing so does not leak information from the future.We emit events of the 100 cases in the test folds based on their simulated timestamps, thereby ensuring that the case and the event distributions of the data are preserved and that the number of cases running in parallel surpasses the limit n.Evaluation metrics.We evaluate our results using four statistics: (i) the percentage reduction in the number of maximum states stored in memory in comparison to the offline approach, (ii) the root mean square error(RMSE) of the fitness cost when compared to the cost calculated using the offline approach, (iii) the F 1 for classification of cases as conformant or non-conformant, and (iv) the percentage reduction in average processing time per event(APTE) in comparison to the offline approach.For calculating the trace fitness cost, all our experiments use the default unit cost of 1 for both log and model moves, while synchronous and model moves with silent transition τ incur a cost of 0. F 1 is computed on the basis of the binary classification of cases as conformant and non-conformant.Cases with 0 trace fitness cost are considered as conformant and those with non-zero fitness cost as non-conformant.Considering the classification by the offline approach as ground truth, F 1 is computed through the correct classification of cases as conformant and non-conformant by our proposed approaches.It should be noted that the reported case-level F 1 is not related to the classifier used in the MLc.As mentioned earlier, RMSE and F 1 are calculated in reference to an offline approach, referred to as the baseline.The baseline stores the whole event history of a case, and hence its computed prefix alignments are always globally optimal.We are reporting the average fitness cost over the whole log which is computed by the baseline in order to put the reported RMSE of our proposed approaches into context.Please note that in all the experiments with MLc, we consider the fitness costs estimated for the latest sequence of events of a case as its estimated trace fitness costs for the purpose of RMSE calculation.Furthermore, in the experiments with synthetic data for MLc, we calculate RMSE over different folds and then aggregate it.For computing APTE, we replicate the event stream 50 times along with renaming the cases in each replication.Hence, effectively we process 50×cases in the event logs or 50×events therein, thus mimicking a much larger event stream.We use the APTE as the mean value over these 50 iterations.
Results of the CFcs and MLc experiments are represented as heatmaps, while results of the CFc experiments are represented as bar charts.The colors of the bars are consistent with the color scheme of the heatmaps, i.e., the best values are represented with brighter colors and the worst values get darker colors.We use the standard scales of [0, 100] and [0, 1] for percentage reduction and F 1 metrics, respectively.Lacking a standard scale, RMSE is scaled on the best and worst values observed over all the experiments, which are [0, 9.3].Accordingly, the RMSE color semantics are consistent throughout the paper.However, the Y -axis scale of the bar charts has been adjusted in order to increase the visibility of the trends.Furthermore, the actual values of the metrics are selectively plotted on the heatmaps to complement the color scheme.
The baseline approach avails both w and n equal to ∞, i.e., infinite memory.Considering the article's space limitations and to make the results concise and comprehendible, we are selectively presenting the results.A complete and extensive version of the results is provided in Appendix A.

Effectiveness of the cost-driven forgetting criteria
We present a comparative analysis of our fitness-cost-driven forgetting criteria with the inactivity-based forgetting criteria prevalent in the literature.We performed the experiments with BPIC'12.Referring to the results provided in Table 5, the inactivitybased forgetting criteria outperformed our fitness-costs-based forgetting criteria in only 8 out of the 63 performed experiments.As such, we proceed in the following experiments with our fitness-costs-based forgetting criteria.It should however be noted that a forgetting criteria tailored to the relevant process may prove to be more effective.For instance, consider processes with bounds on the case duration, say imposed by service level agreements.For such processes, the age or idleness of the cases may be a good indicator of the conclusion of cases which makes them good candidates for forgetting.
From an implementation and performance perspective, our current prototype traverses D C as per the forgetting criteria every time a forgetting is necessitated, i.e., n overflows.However, the frequency and extent of the D C scans can be optimized.For instance, D C can be adapted to be a priority queue where the retained cases are prioritized according to the forgetting criteria.

Results for CFc
The results of the states consumption when experimenting with BPIC'12 are depicted in Fig. 6a.CFc is highly frugal in storing data in all the case limits n in comparison to the baseline.As expected, this consumption increases with the increase in n.As depicted in the dotted chart of BPIC'12 in Fig. 5, the cases are significantly parallel and activities are mostly executed in batches.Therefore, prefix alignments are continuously forgotten in all our case limits and they do not grow significantly in terms of the number of states.Consequently, the states consumption exhibits a steady increase with an increase in n.
The accuracy of the estimated fitness cost of BPIC'12 is depicted in Fig. 6b.Even with retaining 50 cases, we observe a nominal RMSE (<1.0) with regard to the 0.5 average trace fitness cost of the baseline.Although not proportionally, RMSE decreases with the increase in the case limit n, which seems logical in light of the parallelism of cases and batching of activities.An investigation of RMSE revealed the multi-choice effect as the major contributing factor as the reference process model for BPIC'12 contains multiple sources of multi-choice effect.
In Fig. 6c, F 1 for all the n values is 1.0.This perfect F 1 implies that CFc is accurate in the binary classification of cases, i.e., whether they are conformant or non-conformant, in BPIC'12.However, recall that CFc has the potential to classify conformant cases as non-conformant.Regarding the APTE, we discussed that the state size of cases in D C does not grow significantly due to the continuous forgetting of cases.Hence, even the shortest-pathsearch-based prefix alignments method performs fewer computations in comparison to the baseline.Therefore, as evident in Fig. 6d, all the case limits n are performing better compared to the baseline on the APTE aspect as well.Hence, CFc is computationwise stream-friendly.The maximum processing time observed for any single event in all the n limits is in the 15-24 ms range.Therefore, CFc can deal with an event stream having around 24 ms interval between any two consecutive events.
For the states consumption by the synthetic event data, representative results of a single skewness level of a22 are provided in Fig. 7.We observe a stark reduction in the states retention by CFc, irrespective of the noise and skewness.In Fig. 8, RMSE generally decreases with decrease in number of parallel cases and also with increase in case limit n.The noise η also has a direct relation with RMSE.In the absence of any noise, i.e., η = 0, CFc exactly estimates the true trace fitness cost.However, events with noise may adversely affect the accuracy of our approach.For instance, when none of the n cases is conformant, the forgetting criteria forget non-conformant cases.For non-conformant cases forgotten with log moves at exactly the end of their prefix alignment, this semi-premature forgetting results in overestimating the fitness costs.We observe that some of such cases are forgotten more frequently than others and hence the majority of their observed events do not fit.Referring to the complete results provided in A.8 to A.10 in the Appendix, the process alphabet, i.e., the number of activities in the process model, does not affect the accuracy of CFc when the cases are noise-free.However, RMSE increases with an increase in the process alphabet for cases with noisy events.The aforementioned behavior is justified as an increase in the alphabet size usually results in an increase in the average number of events per trace.Accordingly, the extent of the induced noise  also increases, while our case limits n and the cases/events distributions are the same for all these experiments.The impact of the skewness  is not much relevant, although the results might portray a rising trend of RMSE with an increase in skewness level for a12 and RMSE decreasing with an increase in skewness levels for a22 and a32.In essence, the total number of events increases with increasing  in a12, while it decreases with increasing  in a22 and a32.Furthermore, the type and the proportion of the different noises that the noise induction mechanism in [52] generates in these event logs and the behavior of our approach to different noise types contribute to the aforementioned trends.We observe F 1 of 1.0 for all the n values.

Results for CFcs
The results of the states consumption when experimenting with BPIC'12 are depicted in Fig. 9a.CFcs is extremely light on memory for all the state and case limit combinations in comparison to the baseline.The states consumption depends on the inter-arrival rate of events and cases, and the trace length as well.Therefore, for some w values, the maximum states consumption (at any point) does not reach the maximum possible limit in D C , i.e., n × w.Referring to Fig. 9b, the two-dimensional bounding causes a slight increase in RMSE with respect to CFc.RMSE for state limits 2 to 4 does not differ a lot which implies that, related to the type and length of the noise, these state limits result in almost the same sub-optimal prefix alignments.Starting with w=5, the approach can optimally revisit the prefix alignments, resulting in a comparatively lesser RMSE.Referring to Fig. 9c, we observe an F 1 of 1.0 for all the w and n combinations for BPIC'12.
However, as thoroughly discussed in CFc results, CFcs can also potentially classify conformant cases as non-conformant.APTE results are depicted in Fig. 9d.Similar to the experiments with CFc in Section 5.3, CFcs also performs far better in all w and n combinations.The maximum observed time required for processing any single event in all the w and n combination limits is in the 14-32 ms range.Hence, event streams having around 32 ms interval between any two consecutive events are recommended as input.
For the synthetic data, refer to the representative RMSE results provided in Fig. 10, our findings are in harmony with those of the BPIC'12.The impact of various factors, such as noise, skewness, the parallelism of cases, and process alphabet, is in line with the findings of the experiments with CFc.

Results for MLc
The results of the states consumption for experiments with BPIC'12 are depicted in Fig. 11a.As evident, the states consumption of MLc is significantly less in comparison to the baseline.The estimated fitness costs are provided in Fig. 11b.As anticipated, an increase in n causes RMSE to decrease for a specific feature size f .Similarly, RMSE decreases with f increasing from 1 to 2. However, RMSE is higher for f = 3 with respect to f = 2.The aforementioned scenario is interesting as a larger feature size shall ideally result in more accurate predictions.Two factors contribute to this peculiar behavior.First, with a larger f , developing cases need to wait longer to reach the size f .For f > 1, we consider the orphan events of the developing cases as log moves.In case of unavailability of a conformant case, the forgetting criteria may recommend the developing cases to be forgotten.This premature forgetting of developing cases contributes to the wrong estimation of their fitness costs.The second potential source of the aforementioned peculiar behavior is disoriented or misconceived to be end-marker events.For f > 1, if an end-marker event is observed for a developing case even before reaching f , then a parent marking is predicted.Consequently, the predicted parent marking of a swapped end-marker event may not fit some of these < f orphan events.Similarly, some events are both intermediate and end-markers in process models having duplicate labels.Such event labels are (wrongly) conceived to be end-markers of their relevant cases by MLc, and hence the events observed afterward do not fit.The severity of these effects increases with the increase in f .By virtue of all the aforementioned factors and masking of the noise effect, we observe an F 1 of less than 1 in the results in Fig. 11c.Referring to Fig. 11d, despite the prediction overhead, we observe significant APTE reduction in comparison to the baseline.The maximum processing time for any single event in all the f and n combination limits is consistent in the 16-19 ms range.Therefore, a delay of around 32 ms between any two consecutive events in the input event stream is desirable.
Interestingly, we observed an anomalous yet not disadvantageous effect.As discussed in Section 4.2, anomalous instances usually result in misclassification.However, in this specific process, more than one marking of the anomalous instances are legitimate from a conformance point of view.In other words, these markings are behaviorally equivalent.For instance, the sequence of transitions corresponding to the sequence of events ⟨O_Selected, O_Created⟩ (cf.Figures A.21 in the Appendix) can be fired from any marking with a token in either place P 1 or a token each in places P 8 and P 9 .Hence, any two markings which only differ by having a token in place P 1 in lieu of places P 8 and P 9 or vice versa are behaviorally equivalent and do not result in any difference in the fitness costs.All the insights gained through BPIC'12 hold for the experiments with the synthetic data as well.As illustrated in the representative results in Fig. 12, RMSE usually decreases with the decrease in the parallelism of cases ρ and with the increase in n.
Taking further the insights gained on over-and underestimation of fitness costs in BPIC'12 results, the impact of noise on MLc mostly depends on the location and the type of noise.Fitness costs for cases with missing head are most likely to be wrongly estimated as MLc considers such cases to be orphaned.The costs for cases with swapped events and those with missing events depend on the position at which they are forgotten.For instance, if noisy events get masked in between the forgotten prefix and the subsequent orphan events of a case, then the fitness cost for such a case may be underestimated.Due to the aforementioned effects of noise, as illustrated in the representative results in Fig. 12, RMSE usually increases with the increase in the noise level η of the event logs.By virtue of all the aforementioned under-as well as over-estimation of the fitness costs of cases, we usually observe lower F 1 for MLc compared to the stateful approaches, as illustrated in Fig. 13.
Referring to Figures A. 26 to A.28, RMSE poses to be increasing with the increase in process alphabet, i.e., a12 > a22 > a32.
We are considering the same n and f parameters for all the experiments regardless of the log and process characteristics, whereas a high process alphabet results in longer traces and relatively more noise is induced.Additionally, the process models in an increase in the average number of events per case with the increase in  in a12, whereas the average number of events per case decreases with the increase in  in a22 and a32.The average number of events per trace has a direct relation with the induced noise.

Discussion
Our primary stateful approach, CFc, is a sound technique that is robust to the skewness, the parallelism of cases, and to a certain extent to the noise in the data.Even with a limited case retention capacity, the approach can accurately detect nonconformant cases.For cases with noise, the approach may, however, suffer from a tolerable loss of accuracy in estimating fitness costs.All the aforementioned data characteristics being intractable, we can, however, optimize the case limit n in order to minimize the error in estimating the fitness costs.Additionally, the forgetting criteria can be manipulated to improve the accuracy of CFc.The stateful CFcs, inherits all the characteristics of the parent CFc approach.However, with a suitable state limit w, CFcs avoids the unnecessary retention of data, thereby, for instance, being more privacy-friendly.The suitable state limit w is mainly related to the number of events sufficient to revisit a prefix alignment for optimality [11].
The MLc approach is relatively more prone to the characteristics of the process model and the event data.Having no influence over these characteristics, we can, however, optimize and tune certain parameters of MLc.For instance, the accuracy of the prediction model over different classes can be taken in to consideration in the conditions of the forgetting criteria.In essence, we ideally forget cases at specific reached markings where the probable sequence of the orphan events results in a more accurate prediction of parent marking.The anomalous instances problem, which is caused by label duplication and parallelism in the process model, can be avoided through an adequate f size and diversity of these f features.For the latter, we can incorporate context information as features in addition to the event labels which is expected to demonstrate a positive impact on the accuracy of the predictions.Nevertheless, an f > 1 bears other intricacies which cannot be completely influenced.Though all these proposed approaches serve the same purpose, by virtue of storing information regarding the forgotten events/cases, the approaches belonging to the stateful class are relatively more reliable in comparison to the stateless approaches as the latter is an estimation.On the other hand, the stateless approach is very light on memory in comparison to the stateful approaches.However, the machine learning classification model of the stateless approach may require maintenance, for instance, retraining in case of a concept drift.The stateful approaches are more suitable for processes with long and less noisy traces.On the other hand, the stateless MLc is useful if the data retention policies are stringent, the reference process model is sequential or lucent [54], or if the arrival rate of cases is high.

Conclusion and future work
We presented two classes of approaches to avoid extensive retention of data in event streams setting.The proposed approaches retain minimum data and, at the same time, avoid the missing-prefix problem.We instantiated our proposed approaches in prefix-alignments-based OCC.The effectiveness of these approaches is established through experiments with reallife and synthetic event data with diverse characteristics and by mimicking it as an event stream.In addition to the reduction in data retention, the proposed positively impact the overall processing time of cases.These approaches are equally applicable in general (prefix)alignments and can easily be extended to other CC techniques.The stateless approach can also be adapted for imputation of incomplete cases in offline environments.Such imputation will increase the quality and utility of the data for other mining tasks.
Based on the observations and findings of the conducted experiments, we identify some areas for enhancement and future work.Although only in process models bearing specific characteristics, our stateful approaches suffer from the multi-choice effect caused by anomalous execution sequences.A technique for  R C in the stateful approaches is unbounded and can grow infinitely large with the passage of the process.In order to devise a completely bounded stateful approach, a mechanism to forget summary states from R C needs to be devised for the CFcs.Similarly, non-conformant cases can survive infinitely in MLc.Therefore, a stochastic approach needs to be introduced into the forgetting criteria such that the probability of any case in D C to survive indefinitely and hence infinitely approaches zero.Our proposed approaches do not take into consideration stream imperfections such as out-of-order arrival of events.Mechanisms for dealing with stream imperfections on the lines of [55,56] need to be devised.Finally, orientation of the parent marking prediction in MLc as a multi-class multi-label problem can potentially address the anomalous instances problem.
where ℘ is the average number of states over the n prefix alignments in D C .Certain properties of process models, such as label duplication, may adversely affect the accuracy of CFc.For illustration purposes, consider the three execution sequences ⟨A, B, C , D⟩ of the example process model in Fig. 2.These identical execution sequences lead to different reachable markings, i.e., [p o ], [p 1 ], and [p 3 ].This somehow anomalous behavior is caused by Transitions t 7 , t 9 , and t 10 as they share the label D and are in a choice relation.In marking [p 5 ], for an observed event D, the prefix alignments will have to make a choice among the aforementioned three candidate transitions.Let us suppose that the online prefix alignments approach lacking any information about future events, always fires the Transition t 9 resulting in marking [p 1 ].

Fig. 3 .
Fig. 3. Distribution of sequences of different lengths in the test set of BPIC'12.Different to prefixes, a sequence here represents the number of consecutive events in any part of the trace.

Fig. 4 .
Fig. 4. Parallelism of cases in the test set of BPIC'12 real event data and five variants of a22 synthetic data.The a22 variants differ in the parallelism of cases, such that, ρ1 represents the maximum parallelism of cases where all the cases in the event log are running in parallel, whereas ρ5 represents cases with no parallelism.

Fig. 5 .
Fig. 5. Train-Test splitting of BPIC'12 event data to avoid future information leakage.The blue-colored dots represent the events of the cases included in the train set, while the magenta-colored dots represent those included in the test set.The green-colored dots represent the events of the cases which are overlapping between these two sets and are filtered out.(For interpretation of the references to the color in this figure, the reader is referred to the web version of this article.).

Fig. 6 .
Fig. 6. Results of BPIC'12 with CFc.A dark color represents the worst value of the respective metric, while brighter colors encapsulate its best values.The number on the secondary Y -axis of (a) is the maximum states consumed, (b) is the avg.fitness costs over the log, and (d) is the APTE by the baseline(BL) in milliseconds.

Fig. 7 .
Fig. 7.Memory footprint for a22 event logs for decision skewness level 0 and different noise levels with CFc.A dark color represents the worst value of the respective metric, while brighter colors encapsulate its best values.The number on the secondary Y -axis is the maximum states consumed by the baseline (BL).

Fig. 8 .
Fig.8.RMSE for a22 event logs for decision skewness level 0 and different noise levels with CFc.A dark color represents the worst value of the respective metric, while brighter colors encapsulate its best values.The number on the secondary Y -axis is the avg.trace fitness cost over the log by the baseline (BL).

Fig. 9 .
Fig. 9. Results of BPIC'12 with CFcs as a heatmap.A dark color represents the worst value of the respective metric, while brighter colors encapsulate its best values.The number on the secondary Y -axis of (a) is the maximum states consumed, (b) is the avg.fitness costs over the log, and (d) is the APTE by the baseline (BL) in milliseconds.

Fig. 10 .
Fig. 10.RMSE for a22 event logs for decision skewness level 0 and different noise levels with CFcs as a heatmap.A dark color represents the worst value of the respective metric, while brighter colors encapsulate its best values.The number on the secondary Y -axis is the avg.trace fitness cost over the log by the baseline (BL).

Fig. 11 .
Fig. 11.Results of BPIC'12 with MLc as a heatmap.A dark color represents the worst value of the respective metric, while brighter colors encapsulate its best values.The number on the secondary Y -axis of (a) is the maximum states consumed, (b) is the avg.fitness costs over the log, and (d) is APTE by the baseline (BL) in milliseconds.

Fig. 12 .
Fig. 12. RMSE for a22 event logs with different decision skewness and noise levels with MLc as a heatmap.A dark color represents the worst value of the respective metric, while brighter colors encapsulate its best values.The number on the secondary Y -axis is the avg.trace fitness cost over the log by the baseline (BL).ofa12, a22 and a32 require 2, 4, and 5 tokens in their parallel branches, respectively.Hence, any of the f values considered in this work causes more accurate results for a12 in comparison to a22 and for a22 in comparison to a32.The impact of skewness of decisions  on the accuracy of MLc is interesting.Referring to Fig.12and Figure A.28, the higher  in a22 and a32 is inclined towards the branches with none or less parallelism, and hence the accuracy of MLc increases.On the

Fig. 13 .
Fig. 13.F 1 for a22 event logs with different decision skewness and noise levels with MLc as a heatmap.A dark color represents the worst value of the respective metric, while brighter colors encapsulate its best values.revisiting the past alignment decisions can mitigate the aforementioned problem.Additionally, the durability of the machine learning classification model in MLc can be enhanced by enabling it to learn over the course of the process and accommodate

Table 1
An example event log of the process model contained in Fig.1.

Table 2
An example snapshot of the D C .

Table 3
List of the notations used in this work.

Table 4
Details of the event data used in the experimental evaluation.

Table 5
RMSE comparison of our fitness-cost-based forgetting criteria (C) with inactivity-based forgetting criteria (I).