Adjacency Matrix Deep Learning Prediction Model for Prognosis of the Next Event in a Process

Prediction of the next event is important for organizations to improve and optimize their system process to achieve organizational goals. Existing predictive models are limited since they use discovery algorithms that might not be able to conserve the sequences of events as reported in the event log. Discovery algorithms alter the sequence of events in two ways, either the algorithms generate an additional sequence of events not found in the event logs or remove the order of events. Since prediction relies on these process algorithms, the prediction model can suffer and produce underperforming results. Models that do not use discovery algorithms, such as deep learning models, ignore completely the sequence of events. To overcome these limitations, we propose a new algorithm called AXDP (Adjacency Matrix Deep Learning Prediction Model). AXDP predicts the next event of a process using graph theory techniques, specifically adjacency matrices and predicts using the power of deep learning models. AXDP has a major advantage, in that sequence of events is conserved, resulting in better prediction of the next event. When testing AXDP on eight publicly available datasets, AXDP outperforms what we believe to be the most recent and best predictive models that exist for the prediction of the next event for six of the eight datasets.


I. INTRODUCTION
Prediction of the next event is a fundamental part of an organization's process for optimizing a process, preventing undesired outcomes, and performing a task as planned. Unfortunately discovery algorithms have shortcomings. The existing discovery algorithms do not conserve the sequences of events in an event log. Either discovery algorithms add non existing sequence of events or remove existing ones. The addition or removal of the sequence of events in the event log lead to poor predictive models. AXDP (Adjacency matrix deep learning prediction model) was developed to overcome these shortcomings and improve prediction of the next event. AXDP is a predictive model that predicts the next event of a process using adjacency matrices along with deep The associate editor coordinating the review of this manuscript and approving it for publication was Wai-Keung Fung . learning models. AXDP has a major advantage, individual event sequence information for each trace or CASE ID in the event log is kept in the form of an adjacency matrix. In addition, for six out of eight publicly available benchmark datasets, AXDP outperforms what we believe to be the highest performing model for the prediction of the next event. The structure of this paper is as follows: Section II summarizes the related work. The definitions are provided in Section III. We introduce the datasets in section IV. Section V discusses the method, and in Section VI the experiment is introduced and evaluated. We then conclude the paper and discuss future work in section VII.

II. RELATED WORKS
Several examples of literature exist that have used adjacency matrices and properties of adjacency matrices in the VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ healthcare industry for prediction of health outcomes such as Alzheimer's [1]. For instance, Tong has shown that maximum eigenvalue can be used to distinguish the network types which have the same mean degree and can be a good indicator for mental fatigue estimation [2]. Moreover, using adjacency matrices, Shuangzhi was able to detect early mild cognitive impairment (EMCI) and explored individual differences and information association among subjects to improve EMCI detection [1]. Another current work is that of Mayssa who was able to understand each brain region's topological role and its development in the brain functional connectivity (FC) networks for early disorder detection using adjacency matrices [3]. Adjacency matrix properties such as eigenvectors and distance metrics allow mapping of large and complex datasets. Furthermore, there are numerous algorithms for prediction of the next event in a system. The state-of-the-art models for prediction of the next event can be categorized as process mining and deep learning models. Both process mining and deep learning models have one main limitation, they do not use the exact sequence of events of an event log. These models either take a filtered set of each trace or add additional events that were not in the original event log. However, the modified event logs with the added or subtracted events can lead to poor predictions of the next event. AXDP has the advantage of conserving the sequence of events for prediction. The existing works for process mining and deep learning methods are summarized below.

A. PROCESS MODELS
Process models are used to predict the next event. Unuvar creates decision trees to predict the portability distribution of the next activity from the parallel execution paths of the executed traces [4]. Marquez-Chamorro defines an evolutionary rule-based approach that encodes the predictive features and predicts the next events [5]. Pravilovic uses the window technique of encoding the predictive features in combination with a multi-target predicate algorithm [6]. Appice compares the performance of various classification algorithms trained for predicting the next activity in combination with sophisticated process features which are built by considering the entire trace or using the window technique [7]. Francescomarino combined machine learning approaches for clustering and classification, in order to classify new traces during their execution to predict how they will behave in the future [8]. Ferilli and Angelastro use a heuristic strategy and a Naive Bayes approach [9]. Le combines a sequential k-nearest neighbor classification with an extension of Markov models [10]. Theis developed a process mining approach called Decay Replay Mining (DREAM) which creates timed state samples using decay functions which were fed into a Neural Network (NN) [11]. Pishgar proposed a pre-processing method that concatenates events which hold concurrent relations based on a probability algorithm, producing simpler and accurate process models [12].

B. DEEP LEARNING MODELS
Besides not conserving the sequence of events, process models do not consider the diverse information observed in the event log, except for the work of Theis [11]. In addition to process models, deep learning models have shown to be successful in predicting the next event.
For example, Tax uses a long short-term memory (LSTM) approach by introducing the one-hot encoding to transform input activities into feature vectors and supplementing this representation with specific features synthesized from the timestamps [13]. Evermann considers simultaneously activities and resources but ignores timestamps. Evermann applies recurrent neural network (RNN) after forcing the construction of a single perspective by concatenating the characteristics in a single word and then computing an embedding of this feature representation [14]. Camargo combines embeddings and LSTM and trains LSTMs to predict sequences of next events, timestamps, and their associated resource pools [15]. Pasquadibisceglie performs a feature extraction phase to extract aggregated trace characteristics that capture the frequency of activities and resources, and the time performance [16]. Bergstra transforms these trace characteristics into RGB images and trains a convolutional neural network (CNN) using the tree-structured Parzen estimator [17]. Pasquadibisceglie combined grayscale image encoding and CNN to process activities and timestamps [18].
The most recent work, which has shown to outperform existing next-event prediction algorithms is the MiDA (A multi-view deep learning approach) [19] proposed by Pasquadibisceglie in 2021. MiDA is a method for solving the next-activity prediction using a combination of deep learning and multi-view learning. MiDA takes into account different perspectives for each trace [20]. Every categorical attribute is converted into a numerical representation using a coding function for each activity. Each value assigned to an activity is mapped to a dimensional embedding vector that corresponds to a weight matrix. The weight matrix is called an embedding matrix that contains weights that are learned during the training of the neural network. After embedding the categorical variables, the output of the embedding layers and the input of all the continuous variables are concatenated into a single vector. The output of the concatenation layer is fed into a recurrent neural network containing two stacked LSTM layers. Lastly, the LSTM module is fed into a softmax layer in order to predict the next event or activity.
To evaluate the performance of MiDA, Pasquadibisceglie and his team use a 3-fold cross-validation [19]. Using a post-hoc test-The Nemenyi test, MiDA is ranked higher than the baseline models [13], [14], [15], [16], and [18] with a 0.05 significance level on the eight datasets we outline in section IV. One of the main advantages of MiDA is that MiDA takes into account all the data of a trace; resources, process activities, and timestamps.
Experiments with various benchmark event logs prove the effectiveness of MiDA compared to several recent state-of-the-art methods [19]. Although MiDA has statistically ranked higher compared to the baseline model, [13], [14], [15], [16], and [18], MiDA also fails to conserve the sequence of events. There is a need to develop an improved model that takes into account the sequence of events for each trace in an event log.
Motivated by the successful use of adjacency matrices for predicting the next event and high performance of deep learning models we created AXDP. AXDP creates an adjacency matrix of consecutive events from an event log, max eigenvalue of second and third consecutive events, and the frequency of events for each unique trace or CASE ID. These vectors are fed into a deep learning model for prediction. AXDP has demonstrated statistical improvements when compared to baseline models [19], [13], [14], [15], [16], and [18].

III. PRELIMINARIES
In this section, we introduce process mining notations which are required for understanding the proposed algorithm of this paper. The preliminaries include event, trace, event logs, consecutive events, recall, precision, F-score, AUC score and accuracy, adjacency matrices, max eigenvalue and deep learning models. The definitions of event, trace, and event log are based on [20].
An event or activity a ∈ A is an instantaneous change of the state of a system. A system can be for example, a business process or hospital procedure.
Definition 2: Trace. Let A be the universe of activities and A∈A is a set of activities. An event log is a multiset of sequences over A. A simple trace L is a sequence of activities, such as L ∈ A, where L = {e i ∈ A|e 1 , e 2 , . . . , e i ; 1 ≤ i ≤ n}.
Definition 3: Event Log. Event log is a multiset of sequences over A, i.e., L∈(A * ). Moreover, each event has a label ℓ ∈ L and it refers to a task executed within a process, we retrieve the label of an event with the function λ : A → L to provide the label of the event, using, λ(e) = e ℓ .
The definitions of consecutive events below are based on [21].
Definition 5: Second Order Consecutive Events. Let L be an event log over A. Let two events a, c ∈ A: a > L2 c iff there is a trace, L =< t 1 , t 2 , . . . ., t n > and i ∈ {1, 2, ..n − 1} such that t i = a and t i+2 = c. Definition 6: Third Order Consecutive Events. Let L be an event log over A. Let two events a, d ∈ A: a > L3 d iff there is a trace, L =< t 1 , t 2 , . . . ., t n > and i ∈ {1, 2, ..n − 1} such that t i = a and t i+3 = d. The subsequent equations of metrics precision, recall, and F-score is based on [22].
Definition 7: Precision and Recall.
Precision is calculated as the ratio of correctly predicted positive examples divided by the total number of positive examples that were predicted [22]. Precision answers the question, what proportion of positive identifications was actually correct?
where TP are the true positives and FP are the false positives.
Recall attempts to answer the question, what proportion of actual positives was identified correctly? [23] Precision is appropriate when minimizing false positives and recall is appropriate when minimizing false negatives [24].
Definition 8: F-Score. F-Score, which weights precision and recall equally, is the variant most often used when learning from imbalanced data [22].
Definition 9: AUC Score. Area Under the Curve (AUC) measures the entire two-dimensional area present underneath the entire ROC curve. AUC of a classifier is equal to the probability that the classifier will rank a randomly chosen positive example higher than that of a randomly chosen negative example. ROC curve is used to help determine the exact trade-off between the true positive rate and false-positive rate for a model using different measures of probability thresholds [23].
Definition 10: Accuracy. Accuracy is the ratio of the units of correct predictions and the total number of predictions made by the classifiers [24]. E) is a simple graph where, V , are the vertices representing the unique events in the event log such that |V | = n, and E are the edges representing consecutive events. The adjacency matrix A is an nxn zero-one matrix with 1 as its (i, j)th entry when i and j are consecutive or adjacent, and 0 as its (i, j)th entry when they are not adjacent. In other words, if its adjacency matrix is A= [a ij ] [25], then Definition 12: Maximum Eigenvalue. Given adjacency matric A, v is the eigenvector corresponding to eigenvalue λ, λ max is the maximum eigenvalue such that [2] λ max = {|λ 1 |, |λ 2 |, . . . , |λ n |}.

IV. DATASETS
The event logs listed in Receipt is another data set that consists of the process of different municipalities in the Netherlands. Receipt data set has 1,434 processes registered with a total 8,577 events of 27 unique activities. BPI17Offer is a data set pertaining to the loan application process of a Dutch financial institute. BPI17Offer contains 42,995 applications submitted, with 193,849 total events of 8 unique activities in the application process. The final data set used to evaluate the performance of AXDP was BPI20Request, event logs pertaining to requests for payment that are not travel-related. BPI20Request has a total of 6,886 requests registered, with a total of 36,796 events of 19 unique activities.

V. APPROACH
Process algorithms have two underlying issues that hinder the prediction capability of prediction models. First and foremost, current process algorithms overgeneralize the observed behavior in the event log [26]. Overgeneralization in this case means that the produced process algorithms create additional event sequences that do not exist in the original event log. Moreover, sometimes process models remove existing sequences of events. Thus, there is a need to develop a model that conserves the sequence of events in event logs to best represent the observed behavior in event logs and produce models that are able to predict the next event successfully. Moreover, deep learning models such as MiDA do not consider event logs, they ignore the sequence of events. As a result, AXDP was created with the goal to conserve the order of events using adjacency matrices and their properties. AXDP takes an event log that contains the activities or events, time stamps, and the Case ID. The first step is to map each trace or each Case ID's events into an adjacency matrix. The adjacency matrix will be of size nxn, where n is the total number of events. For example BPIC-20 contains 19 events, so the adjacency matrix is a 19 × 19 matrix. AXDP creates m number of adjacency matrices, where m is the total number of traces. The adjacency matrix assigns a one to consecutive events and a zero otherwise.
The adjacency matrix is created using the following procedure: First and foremost, AXDP takes the tuple of events, that is consecutive events are paired to create tuples. Using a graph function, these tuples are then used to create an adjacency list for each trace. The adjacency list is then mapped to create an adjacency matrix for each trace. The adjacency matrix is collapsed to a 1xn 2 array in order to feed it to a neural network. The same pairing is done for second and third order consecutive events. For instance, if we have a trace with the following order of events {a, b, g, d, c}, the events a and g are second consecutive events. So we create the tuple < a, g >. For third consecutive events we will take a and d and create the tuple < a, d >. Taking into account 2nd and 3rd order consecutive events is important; most process mining methods only taken into account first order consecutive events.
In order to capture the entire process of a trace we also have to look at higher orders of consecutive events, since these relationships can influence the outcome of predicting the next event. The eigenvalues for all three matrices is calculated for 2nd and 3rd order consecutive events. Only the maximum eigenvalue for 2nd and 3rd order consecutive events are used as input. Furthermore, the frequency for each event for each trace is calculated. For the deep learning NN, the input is the concatenated adjacency matrix array mxn 2 for the first order consecutive events, the maximum eigenvector for 2nd and 3th order consecutive events, and the frequency of each event. The AXDP framework can be visualized in FIGURE 1.

Algorithm 1 Adjacency Matrix Deep Learning Prediction Algorithm
Input: An even log, L; Output: Prediction of next activity, a i+1 ; Initialisation: 1: for every trace, L =< t 1 , t 2 , . . . , t n > such that L ∈ L do 2: For every a, b ∈ A such that a > L1 b create a tuple < a, b >.

3:
For every a, c ∈ A such that a > L2 c, create a tuple < a, c >.

VI. EVALUATION
In this section, we evaluate our proposed approach using the AXDP model introduced in Section V on the eight publicly available datasets in TABLE 1 and compare the performance of AXDP with MiDA and baseline models MiDA was compared with. MiDA approach was compared to many current successful predictive models using the same benchmarks. We contrast our method specifically to the algorithm of the baseline models [19], [16], [15], [13], [14], and [18]. AXDP is trained on the training set which is two-thirds of the data and we evaluate AXDP's ability to predict the next event on the remaining one-third, or the holdout fold. The metrics used to evaluate AXDP are the same as Pasquadibisceglie's model, MiDA [19]. The AUC score is calculated to compare and assess the performance of AXDP to [19], [16], [15], [13], [14], and [18].

A. DEEP LEARNING MODELS
This section summarizes the deep learning architectures used for all eight datasets. For all eight datasets, the architecture of the NN consisted of only three layers; the input layer, one hidden layer, and the output layer. The architecture for the NN is summarized in TABLE 3.
AXDP has the highest observed AUC score for 75% of the datasets. A hypothesis test was conducted and confirmed that there is a significant statistical difference between the 95% AUC confidence interval for AXDP and all the baseline models. The confidence intervals are included in Table 2. From VOLUME 11, 2023   Furthermore, in order to test whether the confidence intervals in TABLE 2 were statistically different, the Wilcoxon Test was applied. The level of significance was selected to be 0.05. In TABLE 2, we can see that all the p-values were less than 0.05 for all datasets, hence all the AUC scores for all the models compared to AXDP are statically different for all eight datasets. For the data sets BPI12Complete, BPI12W, BPI13Incident, BPI13Problem, Receipt and BPI20 Request, AXDP has the highest AUC score compared across all baseline models. The low p-value indicates that AXDP is significantly improved at predicting the next event compared to all baseline models for these six datasets. However, for the dataset BPI2WComplete, [19], [16], and [13] have a higher AUC score than AXDP. AXDP outperforms the other baseline models, [15], [14], and [18]. For the dataset, BPI17Offer AXDP performs the lowest compared to all baseline models in terms of AUC.
Note, AUC score was used as the metric for comparing AXDP with the baseline since research has indicated that the Area Under the ROC Curve (AUC) has been recognized as a more appropriate performance indicator for cases that involve imbalanced data sets [27]. Furthermore, AUC does not have any bias toward models that perform well on the minority class at the expense of the majority class [28]. AUC is able to evaluate between false positive rate and true positive rate across all possible cut-off threshold probabilities. Overall, since AXDP outperforms the baseline models for six out of the eight datasets in terms of the AUC score results show that AXDP is a better model for predicting the next event.

C. ABLATION
This subsection outlines the ablation study on the neural network architecture for all eight benchmark datasets. AUC score was used to evaluate the optimal NN architecture for all datasets. For the number of layers, it was clear that adding two hidden layers for all eight datasets produced the lowest AUC score, as shown in TABLE 4 and TABLE 5. This justifies the one hidden layer for the proposed model NN for all datasets.
The proposed neural network architecture is analyzed further using three tests. First and foremost by changing the number of neurons. A step-wise approach of 20 was used for the number of neurons. Note that for each hidden layer twenty neurons were added and subtracted. Second, a stepwise increasing approach was used to search for the optimal learning rate. Lastly, the epochs and batch size was changed using a local based approach starting at (50, 50) and 50 was added to both the epochs and batch size. The results in TABLE 4 and TABLE 5 indicate that the proposed  architectures, summarized in TABLE 3, is locally optimized.

VII. DISCUSSION
The results summarized in Table 2 demonstrate that AXDP performed better in predicting the next event. The baseline models were deep learning models or process models. These models relied on a pre-processing step that did not conserve   the sequence of events in the log. Some of these prediction models filtered the event log, as a result removing some of the events. Other models added event sequences that did not actually occur. The lack of conservation of the exact sequence of the events is a shortcoming that impacts the predicatively of a model. AXDP on the other hand maintains each sequence VOLUME 11, 2023 of events without filtering or adding unexisting sequences of events.
Since AXDP outperforms 75% of the state-of-the-art baseline models, it is clear that the sequence of events in the log is important to maintain when predicting the next event. This means that the accuracy and exactness of the data fed into process models and deep learning models is important.
AXDP clearly has the advantage of taking into account the sequence of events. Although AXDP did not take into account all sources of information from the event log, the performance was still better compared to the other baseline models. One explanation is that when taking into account the sequence of events in the prediction, you are looking at the history and how past events influence feature events. Yes, all sources of information are crucial, but perhaps this work brings light to the value of the sequence of events' importance for the prediction of the next event.
AXDP takes as inputs the adjacency matrix of each single trace for first order consecutive events, the max eigenvalue for second and third order consecutive, and as fourth input, the frequency of each event. AXDP was evaluated using the same event log splits for eight publicly available datasets summarized in TABLE 1. The results showed that AXDP outperforms these baseline models for six out of the eight datasets using AUC score, summarized in TABLE 2. A shortcoming of the AXDP approach is that when the event logs contain a large number of events, the computation might be slower or too complex for a computer system to evaluate. The largest event log examined in this work contains 27 event logs. The input for AXDP contains at least n 2 inputs. As a result, the computation cost can be high if n is very large.
Furthermore, performance can be improved if we examine other variables such as the duration or time of each event. Time has shown to be a successful variable to examine for works such as in [11]. Perhaps, including decay time functions in the AXDP framework will improve the model. One of the challenges faced was the imbalance of events. There were events that were more frequent, so this can affect the performance of any model. Using AUC score helps but those not address the issue entirely, since AUC score gives you a score using different thresholds, which minimizes bias, but does not directly address the imbalance class issue. AXDP is better than the baseline models, [19], [16], [15], [13], [14], and [18] at predicting when having imbalanced classes, however, this is an area of research that needs to be explored more. There is no existing work, to the best of our knowledge, for optimally dealing with class imbalance. This is a great potential area of research.
MARTHA RAZO (Graduate Student Member, IEEE) received the B.S. and M.S. degrees in applied mathematics from the Illinois Institute of Technology, in 2017. She is currently pursuing the Ph.D. degree in industrial engineering and operation research with the Department of Mechanical and Industrial Engineering, University of Illinois at Chicago (UIC). Her current research interests include process mining, deep learning, inventory analysis, sales forecasting, and big data.
HOUSHANG DARABI (Senior Member, IEEE) received the Ph.D. degree in industrial and systems engineering from Rutgers University, New Brunswick, NJ, USA, in 2000. He is currently a Professor with the Department of Mechanical and Industrial Engineering, University of Illinois at Chicago (UIC). His research has been supported by several agencies, such as the National Science Foundation, the National Institute of Standard and Technology, and the National Institute of Occupational Safety and Health. His current research interests include the application of data mining, process mining, and time series classification in design and analysis of healthcare, safety, and education systems. VOLUME 11, 2023