Learning process modeling phases from modeling interactions and eye tracking data

The creation of a process model is a process consisting of five distinct phases, i.e., problem understanding, method finding, modeling, reconciliation, and validation. To enable a fine-grained analysis of process model creation based on phases or the development of phase-specific modeling support, an automatic approach to detect phases is needed. While approaches exist to automatically detect modeling and reconciliation phases based on user interactions, the detection of phases without user interactions (i.e., problem understanding, method finding, and validation) is still a problem. Exploiting a combination of user interactions and eye tracking data, this paper presents a two-step approach that is able to automatically detect the sequence of phases a modeler is engaged in during model creation. The evaluation of our approach shows promising results both in terms of quality as well as computation time demonstrating its feasibility.


Introduction
Process models play an important role in facilitating communication between different stakeholders and in documenting the organization's business processes.They are used for redesigning business processes as well as for automating them [1].Process model development is an iterative and collaborative process which involves many stakeholders like domain specialists and system analysts [2].It consists of two phases -elicitation and formalization.In elicitation phases, statements about the domain are generated and validated.In formalization phases, the extracted information is then used to create a formal process model [3].
The elicitation phase requires good communication between stakeholders and has been described in the literature as a negotiation process [4] in which different modeling alternatives are discussed.The formalization phase of a process model, also denoted as process of process modeling (PPM), in turn, can be characterized as cognitive design activity during which a designer creates a formal process model from informal requirements descriptions [5].This requires the designer to create a mental model of the domain and to externalize it as a formal process model [6].During process model formalization the designer engages with the modeling platform (that provides a modeling notation as well as associated tool support) to improve the process being modeled.Research on model formalization has resulted in a description of the PPM and the identification of five distinct phases (i.e., problem understanding, method finding, modeling, reconciliation, and validation) [7].The formalization of a process model is a flexible process during which the different phases are iteratively executed.A single instance of the PPM is called a modeling session.
Since each phase is characterized by different underlying cognitive processes, the factors determining a modeler's performance might vary for different phases.For example, while domain knowledge plays presumably an important role in problem understanding, process modeling knowledge and experience will be a relevant factor for modeling phases [7].To enable a fine-grained analysis of the process of process modeling including its different modeling phases, the automatic detection of all five phases is required.
In [7,8], an algorithm for automatically detecting modeling and reconciliation phases from interactions with the modeling platform has been proposed.This approach exploits the interactions with the modeling platform to differentiate between modeling and reconciliation phases.Longer time periods characterized by an absence of interactions, i.e., problem understanding, method finding, and validation, cannot be differentiated and are labeled as comprehension.A comprehensive analysis of the process of process modeling, however, requires the detection of all five phases and not only those during which interactions with the modeling platform take place.
To overcome the limitations of the existing state of the art [7,8] this paper aims at the development of an automatic approach for phase detection in a single process modeling instance.The central research question of this paper can be formulated as follows: ''Is it possible to automatically and accurately detect the different modeling phases of single modeling instances?''.To answer this research question we first developed a novel, machine-learning approach that exploits model interactions along with eye tracking data and slices the overall process modeling instance into a sequence of phases.We then validated our approach in two experiments.The validation of the approach against real data, referring to process modeling sessions, yielded promising results, both in terms of quality and computation time thus demonstrating the feasibility of our technique.
Our novel approach for automatic identification of phases makes a more fine-grained analysis of the process of process modeling possible and allows to take phase-specifics into account.An automated detection of such fine-grained phases within process modeling sessions is also a precondition for the development of a context-aware modeling platform that is able to detect the current context (i.e., the modeling phase the modeler is currently engaged in) and support the modeler in a phase-specific manner through recommendations, interventions, or even adaptations of the modeling platform (for example, the optimal tool support for validation is presumably different from tool support for problem understanding or method finding) [9].
The remainder of the paper is structured as follows.Section 2 reports some background information regarding the process of process modeling and the automatic phase detection problem.Section 3 formalizes the contribution of the paper and Section 4 evaluates the performance of the approach.Section 5 concludes the paper and sketches possible future work.

Background and related work
This section introduces the process of process modeling as typically presented in the literature including its five phases: problem understanding, method finding, modeling, reconciliation, and validation (cf.Section 2.1).Moreover, it introduces an existing approach for detecting modeling and reconciliation phases based on model interactions, which will serve as a baseline for the machine learning approach proposed in this paper (cf.Section 2.2).Finally, it discusses existing exploratory research on phase detection considering multi-modal data (i.e., model interactions as well as eye tracking data), which served as a starting point for developing our phase detection approach (cf.Section 2.3).

Process of process modeling
Existing research on the process of process modeling describes the act of creating a process model as a flexible process consisting of five distinct phases which are iteratively performed, i.e., they can be executed repeatedly and can be skipped for some iterations as needed [7].
Problem understanding.To develop a process model, modelers need to understand the problem (i.e., both the requirements and the process model created so far).During problem understanding, modelers build an internal representation (i.e., a mental model) of the problem to be modeled within their working memory [6].
Method finding.In method finding phases, modelers decompose the problem into smaller sub-problems and develop a solution that is independent of the concrete modeling notation.This can involve the hierarchically structuring of a process model, but also horizontally dividing the problem into sub-problems that can be mapped to workflow patterns [10] (e.g., embedding an activity into a conditional fragment for creating an optional activity).
Modeling.Once modelers have developed a solution for the problem, they can interact with the modeling platform to implement it by creating an external representation of the problem stored in their working memory.For instance, when using BPMN (Business Process Model and Notation) [11], modelers might insert an activity into the model and embed it into a conditional branch using gateways and sequence flows to implement an optional activity.
Reconciliation.Reconciliation phases are concerned with improving the understandability of the process model and facilitate subsequent phases.This includes changes to an activity label for resolving non-intention revealing naming of activities [12], but also relates to the secondary notation of process models [13,14].
Validation.In validation phases, modelers evaluate the quality of the externalized process model and assess if the model indeed provides a correct solution to the considered problem.In particular, in line with the SEQUAL framework [15] and grounded in semiotic theory, modelers might perform checks for identifying syntactical, semantical, and pragmatic quality issues in the process model [15].
As mentioned previously the creation of a process model is a flexible process during which the above described phases are iteratively executed.Fig. 1 depicts an example sequence of phases executed for one particular modeling session.

Phase detection based on model interactions
In [7] a naïve approach to automatically detect modeling and reconciliation phases from a log of interactions obtained from the modeling platform Cheetah Experimental Platform (CEP) [16] is proposed.With respect to the modeling platform, the creation of a process model consists of a series of model interactions, e.g., adding activities and edges or moving elements for laying out the process model (cf.Table 1 for an overview of possible interaction types).The naïve approach maps the interactions with the modeling platform to the phases presented in Section 2.1.More specifically, interactions for creating model elements, deleting model elements, reconnecting edges, and adding/deleting edge conditions are classified as modeling actions.Interactions for laying out edges, moving model elements, renaming activities, and updating edge conditions are classified as reconciliation actions. 1 Identified actions are then aggregated to phases using the algorithm proposed in [8]; thresholds are used to avoid very short phases.
While modeling and reconciliation phases can be detected by this naïve approach, all time periods characterized by the absence of interactions, i.e., problem understanding, method finding, and validation, cannot be differentiated, but are subsumed as comprehension.Moreover, this assumption makes the technique unreliable in many circumstances, e.g., when a short comprehension phase is surrounded by reconciliation or modeling phases, as documented in [17] (in this case the naïve approach incorporates the comprehension phase into the surrounding).Our work overcomes these limitations and presents an approach that can automatically detect all phases.

Phase detection based on multi-modal data
To address the limitations of the naïve approach introduced in Section 2.2, we propose an approach based on multi-modal data.More specifically, we explored the possibility of using eye tracking data (in addition to model interactions) to distinguish between the phases of problem understanding, method finding, and validation (cf.[18]).
Using eye tracking it is possible to identify fixations, i.e., points on the screen modelers focus their attention on.The sequence of fixations of a modeler is denoted as scan-path.Fig. 2 shows an example of a scan-path during a brief session, with fixations represented as black dots.Areas of Interest (AOI) are a tool that is frequently used for the analysis of eye tracking data.AOIs refer to sub-regions of the stimulus (in our context the modeling platform) [19].In our case AOIs refer to the textual description of the task, the modeling area, and the toolbox, as represented in Fig. 2. Having AOIs allows the extraction of metrics specific for each AOI (e.g., amount of fixations on text) and the analysis of transitions between AOIs [19].For example, Fig. 2 shows transitions from text to toolbox, from toolbox to model, from model to toolbox, and from toolbox to model.
As a first step in the development of a phase detection approach based on multi-modal data, we conducted an exploratory study where we collected data from 116 student modelers and analyzed the comprehension phases we obtained from applying the naïve  phase detection algorithm outlined in Section 2.2 manually.Our preliminary findings from this exploratory study suggested that the fixation patterns we identified are related to the phases of problem understanding, method finding, and validation and can thus form the basis for automatically detecting the respective phases (cf.[18]).In particular, the way how the modelers' eyes transitioned between AOIs (i.e., model, text, toolbox) appeared promising.For example, our exploratory study showed that during problem understanding, the fixations were primarily on the textual description as modelers are building an internal representation of the problem.Similarly, during method finding, fixations were on parts of the textual description, followed by fixations on the modeling platform and the toolbox.Finally, during syntactic validation, fixations showed up on the modeling canvas and occasionally on the toolbox, while during semantic validation, the fixations occurred on the textual description and the modeling canvas with numerous switches between these two AOIs.
To further investigate fixation patterns that could form the basis of an automatic phase detection approach, we developed a visualization tool to represent transitions between AOIs together with the model interactions in an integrated manner [17].An example of such visualization is reported in Fig. 3.It shows a problem understanding phase at the beginning with the attention of the user on the text.Afterward, several method finding phases occurred and all share a similar fixation pattern: the modeler first focused on the model or the text, then the focus briefly (and possibly repeatably) shifted to the toolbox (in order to understand the ''tools'' that can be used), and to the text (to map the requirements to the available tools).This further highlighted the potential of using eye tracking data as the basis for an automatic phase detection approach.
Additionally, Appendix A gives an intuition of how the multi-modal phase detection approach works in contrast to the state of the art approach: it shows a detailed example of a process modeling session including data on how the modeler interacted with the modeling platform.

Identification of phases using multi-modal data
This section introduces our approach for phase detection based on multi-modal data (i.e., model interactions combined with eye tracking data).We first provide preliminary definitions used in the remaining paper (cf.Section 3.1), then report a formal problem definition (cf.Section 3.2), before presenting our approach for phase detection (cf.Sections 3.3 and 3.4).

Preliminary definitions
The multi-modal data we are dealing with consists of each modality of an ordered sequence of events.These events can be interactions with the modeling tool or eye tracking data.In both cases, we formally describe these data as sequences.Just as an Given a set , a sequence is a function  ∶ N + → .We say that  maps index values to the corresponding elements in .For simplicity, in the rest of the paper, we will use the string interpretation of sequences: In order to describe the basic operations we can do with our input data, we need some manipulation capabilities.Therefore, we assume that typical operators over sets and sequences are available and behave as commonly expected.Given two sequences the concatenation operator ⋅ creates a new sequence as the ordered combination of them: . In all our cases, sequence elements contain a time component and we assume all sequences to be sorted temporally, in ascending order.We assume the availability of an operator which adds items to a sequence keeping it ordered.Moreover, we have access to the items of a sequence  = ⟨ 1 ,  2 , … ,   ⟩ using the sequence indexing :   ∈ .Sometimes, the sequence elements are actually tuples, e.g.,  = (, ) and to access their single components we assume a projection operator   () =  and   () = .
A sequence  can be divided into a sequence of (sub-)sequences ⟨ 1 ,  2 , … ,   ⟩, called partition, such that  1 ⋅  2 ⋅ ⋯ ⋅   =  (i.e., the partition covers the sequence).Specifically, given a sequence  with || = , it is possible to identify  − 1 indexes where to cut , as depicted in Fig. 4a.Each cut splits the sequence in two additional parts.The power set2 of the set of possible cuts identifies all possible partitions that can be generated starting from a sequence, as exemplified in Fig. 4b.In this case, the set of possible cuts is {1, 2, 3} and its power set is {{}, {1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}}.Therefore, given a sequence of length , the number of possible partitions is the cardinality of the power set of the set of cuts, which is 2  .

Problem formalization
In this paper we address the problem of dividing a modeling session into phases (i.e., find the partition in the session) introduced in Section 2.1 considering the user's interactions with the modeling platform as well as eye tracking data.Definition 3.1 (Interaction).An interaction  = (,  ) with the modeling tool is a pair containing the timestamp  -when the interaction took place -and the interaction type   which refers to one of those listed in Table 1.

Definition 3.2 (Log of Interactions).
A log of interactions   = ⟨ 1 ,  2 , … ,   ⟩ is a sequence of interactions with the modeling tool, progressively ordered according to their timestamps.
As a second source of input, we consider eye tracking data as detailed in the following, more specifically fixations and transitions.

Definition 3.3 (Fixations).
A fixation represents a period of time in which the gaze was fixed on one point (up to some approximations) [19] Fixations and transitions of an eye tracking session can be calculated from the raw recordings of the eye tracker and are stored in temporal order in a log of fixations and transitions respectively.

Definition 3.5 (Log of Fixations).
A log of fixations   = ⟨ 1 ,  2 , … ,   ⟩ is a sequence of fixations ordered progressively according to their start timestamp.Using interactions, fixations, and transitions our goal is to automatically detect the phases introduced in Section 2.1.We can formalize this problem as finding a sequence of phases from a log of interactions, a log of fixations, and a log of transitions.More specifically, we aim to partition the logs such that each partition corresponds to one phase (where the events belonging to this partition are in line with the phase description outlined in Section 2.1) and adjacent partitions are of different phase types.

Considering the example of
We approach the problem by decomposing it into a hierarchy of sub-problems.The general algorithm describing the overall approach is reported in Algorithm 1.It expects the logs of interactions, transitions and fixations as well as the minimum duration of a phase.Please note that we expect the logs to have synchronized time (i.e., they capture different angles of the same modeling session).In a first step, similarly to the state of the art, the algorithm detects a set of partitions with just high-level phases, i.e., modeling, reconciliation and comprehension phases (cf.line 1, Alg. 1; for details see Section 3.3).

Definition 3.7 (High-Level Phase).
A high-level phase  = (  ,   , ) is a time interval, specified by a start time   and an end time   , with a type  associated.The possible types are  ∈ {ℎ, , }.

Identification of high-level phases using multi-modal data
In order to extract the high-level phases out of a modeling session, we translate the problem at hand into a series of classification problems.Figs.5a and 5b highlight the overall idea of our approach for high-level phase detection which can be divided into 4 steps.In the first step, the algorithm identifies a sliding window (cf. 1  ⃝ in Fig. 5a).In a second step, different features are extracted for the identified sliding window (cf. 2  ⃝ in Fig. 5a).The sliding window is then classified as any of the high-level phases using a pre-trained classifier (cf. 3 in Fig. 5a).This whole procedure is repeated for the entire session, resulting in a sequence of candidate phases.In a subsequent step, adjacent phases with the same phase type are merged (cf. 4  ⃝ in Fig. 5b).Alg. 2 provides a more formal description of the high-level phase detection and is detailed in the following.
To obtain candidate high-level phases our algorithm uses a sliding window approach which allows to capture the optimal separation of partitions among events 3 and to obtain maximum flexibility in the selection of the classification algorithms and the features to use.In Alg. 2 the iteration over all windows is reported in line 5.The selection of the sliding window size plays an important role: since each sliding window is classified as a single phase, its length represents the minimum duration of a high-level phase (or partition).Choosing a too small window size might lead to phases not being recognized (because of a lack of representative features).If the window size, in turn, is too long then phases might not be properly separated.For each sliding window, a set of features is extracted (line 6, Alg. 2) and the phase type is detected (line 7, Alg. 2) using a standard classification technique available in machine learning.The classifiers we tested in this paper are: Naïve Bayes [20,21]; Multilayer Perceptron [21,22]; Support Vector Machines (SVMs) [21,23] (with ANOVA -ANalysis Of VAriance -kernel [24]); Random Forests and Extra Trees [21,23,25].At the end of each iteration, the algorithm appends the new phase to the sequence of phases (line 8).This procedure is repeated until the end of the modeling session is reached (line 5).Once the algorithm for high-level phase detection has provided a classification for the entire modeling session, there can be contiguous phases that were classified with the same type which are merged (line 10 in Alg. 2).
In order to identify the features to use for the classification (cf., line 6 in Alg. 2), we went through a feature engineering process.Data referring to user's interactions with the modeling tool represents the primary source of information.For example, the creation of a new task clearly represents a modeling phase, while the moving of an edge is part of a reconciliation phase.Eye tracking data (i.e., fixations and transitions between areas of interest) is also considered since it allows to refine the classification and in particular can help to identify short comprehension phases (cf.[17]).We selected features to capture all these notions and we grouped them into classes.Given a log of interactions   , a log of transitions   , a log of fixations   and a time window (i.e., a start   and end time   ), the first feature class refers to features describing the user's interactions with the modeling tool.The primary purpose of these features is to capture the high-level phases by analyzing how the user performed the actual modeling:  • The number of interactions #( ) of a particular type  taking place within the time frame of the sliding window: Interactions with the modeling tool can refer to different interaction types (e.g., Create Node, Move Node).Specifically, as reported in Table 1, 19 different interaction types can be differentiated.This class of features counts the number of interactions of a particular type within a particular sliding window resulting in 19 features.Additionally, interaction types can be classified either as modeling interactions (i.e., types 1-12 in Table 1) or reconciliation interactions (i.e., types 13-19 in Table 1).Thus, we consider the number of modeling interactions and the number of reconciliation interactions as two additional features.This set of features is useful to provide information about the actual operations performed during the given sliding window.• Time distance to the closest interactions of a particular type, both occurring before and after the current sliding window: before This class of features considers the time of the closest interactions of a particular type before and after the sliding window resulting into 38 features, i.e., 19 for before and 19 for after.Moreover, we consider 4 additional features capturing the time of the closest modeling interaction (i.e.,  contains activities 1-12 in Table 1) and reconciliation interaction (i.e.,  contains activities 13-19 in Table 1) before and after the sliding window.This set of features is useful to characterize the ''neighborhood'' of the sliding window under examination and therefore to improve the classification of borderline cases (e.g., windows with no interactions, but just before a dense cluster of modeling interactions which suggests a modeling phase about to start).
The second group of features refers to the information coming from eye tracking: • The sum of the durations of all fixations in one area of interest: This measure is computed for each area of interest.Since we have 3 areas of interest (c.f.Section 2.3), we have 3 additional features.Depending on the phase, different areas of interest might be in the focus.For example, during modeling phases, we expect most of the time spent on the model canvas whereas, during comprehension phases, we expect most of the time spent on the text and on the toolbox.This set of features is relevant in order to describe how much time the user spent looking at the different areas of interest, to reinforce the phase classification and therefore to help in disambiguating borderline cases [18].
• The number of transitions from one area of interest to another: This measure is computed for each combination of transitions.As we have 3 areas of interest and it is possible to have transitions within the same area of interest, we introduce 3 2 = 9 additional features.As for the previous group of features, this set of features is relevant in order to disambiguate borderline situations.An example of such case is reported in [17], where a phase was classified as reconciliation even though the number of transitions clearly showed a comprehension pattern (flipping between text and model).
One additional feature is included: the number of features with value set to 0. This feature can be seen as a way of quantifying the absence of interactions.Considering all the features described each sliding window extracts from the logs a feature vector with 76 numerical components.

Identification of low-level phases
As a result of the high-level phase detection algorithm, we obtain a sequence of phases classified as either modeling, reconciliation, or comprehension.In this section, we introduce our approach for low-level phase detection which further refines comprehension phases into problem understanding, method finding and validation.Since, by definition, comprehension phases are characterized by no interaction with the modeling tool, to detect low-level phases we can rely only on the eye tracking data and, in particular, we focused on the log of transitions.To detect the low-level phases we devised two approaches, graphically depicted in Fig. 6b.We rephrased our problem as sequence labeling and applied either (HMM) [23,[26][27][28] or Conditional Random Fields (CRFs) [29,30] to solve it.The low-level phases identification approach starts by filtering the log of transitions for only those that occurred within the comprehension phase under examination (cf.Fig. 6a).Each sub-sequence of transition is converted into a sequence of features (cf. 1  ⃝ in Fig. 6b) and a sequence classifier attaches a possible low-level phase type to each transition (cf. 2  ⃝ in Fig. 6b).Finally, these points are merged into proper phases (cf. 3  ⃝ in Fig. 6b).A formalization of this approach is reported in Algorithm 3: given a log of transitions (line 1, Alg. 3), it first converts it into a sequence of feature vectors (line 2, Alg.3).The sequence of features is fed to a sequence labeler which associates each element with a class.In our case, the labels correspond to the low-level phases we are interested to detect (line 3, Alg.3).Such a sequence of labels is post-processed by merging equivalent contiguous labels, thus creating time intervals describing the actual low-level phases (line 4, Alg.3).
As described in the previous section also for the low-level phases identification we went through a feature engineering process.The set of features includes the source and the target AOIs of the current transition as well as the source and the target AOIs for the 2 previous and the 2 following transitions.Additionally, we added 2 features to indicate if the given transition is the first or the last within the comprehension phase.

Experimental evaluation
This section describes the evaluation of the algorithms for the classification task of the high-and low-level phases.With this evaluation we want to answer the following two research questions: Q1 How well does the machine learning approach classify high-level phases in comparison with the state of the art?Q2 How well does the machine learning approach segment a modeling session into low-level phases?

Data-set and ground truth
The data set used for our evaluation stems from a modeling session experiment where we collected model interactions and eye tracking data from novice modelers.This experiment took place in 2015 at the University of Innsbruck (Austria) and 116 psychology students (participation was voluntary) were asked to model a mortgage process (after receiving proper training).The log of interactions was directly obtained from CEP after the modeling session (cf.Section 2.2); eye tracking data was collected using a Tobii TX300 eye tracker and the log of fixations (i.e., time-series of fixations including coordinates) was obtained with the help of Tobii Studio.The fixations were mapped onto the AOI (cf.Fig. 2) using the coordinates of the fixation provided by Tobii Studio, thus obtaining a representation as described in Definition 3.5.The log of transitions was computed based on the log of fixations.
To extract our ground truth knowledge, i.e., the ''gold standard'' for our dataset, we started from video recordings of different subjects.For this experiment we use data generated by 5 subjects randomly selected among the participants.The video recordings were manually classified by two experts independently and discrepancies between the two classifications were resolved by a consensus building process.For the manual labeling, we used the descriptions of the phases introduced in Section 2.1, i.e., problem understanding, method finding, modeling, reconciliation, and validation.The data-set manually enriched represents the gold standard used for computing the qualitative performance of our approaches.Table 2 depicts the distribution of high-level phases and the distribution of low-level phases among subjects.
Please note that in order to answer the two research questions we will use only parts of the dataset.Specifically, to answer Q1 we will consider only the high-level phases, whereas to answer Q2 we will operate on the low-level phases.To answer research question Q1, we use the state of the art automated phase detection approach by [7, Sec.6.3.1]introduced in Section 2.2, which is based on user interactions only, and compare its results with those obtained by the approach presented in this paper.For the actual comparison, we adopted the accuracy measure, i.e., the ratio of correctly classified phases over all phases.
In order to extensively analyze the possible factors affecting the quality scores, we tested the different classifiers varying the window size and the set of features used.Fig. 7 depicts the corresponding accuracy values.Moreover, the measures for the state of the art approach are depicted for reference.As a further reference, No Information Rate (NIR) curves are reported, indicating the accuracy of selecting a random class.Our results show that the Extra Trees classifier achieves the highest accuracy with a value of 86% when a window of 5000 ms is used and both eye tracking and interaction features are exploited.4 of the tested techniques have configurations that outperform the state of the art and only Naïve Bayes reports very poor performance.
Besides accuracy, we also evaluate the time needed for training and prediction.For both, the training and prediction times, we used the wall-clock as the measure, i.e., the time difference between the time at which the task finished and the time the task started.We repeat the training procedure 5 times and calculate the average.Additionally, we measure the wall-clock time needed by classifiers to predict the phases of a subject not used for training.We repeat this procedure 5 times and the average wall-clock time is computed.Fig. 8 shows the time performance for training the classifiers and for predicting the phases (except for Naïve Bayes due to lack of interest given its accuracy performance).With increasing window sizes the time needed for training and prediction decreases since it becomes easier to classify the time window (due to the presence of more observations).
In conclusion, our results showed that question Q1 can be answered positively, i.e., the new approach outperforms the state of the art approach.

Question Q2: Performance of segmenting into low-level phases
In contrast to the high-level phases classification, the low-level phases classification cannot be performed with the state of the art approach since the low-level phase classification operates only during comprehension phases (i.e., when no interaction takes place).Therefore, to define a baseline, we defined a ''brute-force'' algorithm called Markovian which tries all possible partitions and selects the most likely one, assuming Markov property (i.e., the future state only depends on the current one), as described in Appendix B.
For evaluation, we use the accuracy and the wall-clock time as measurements on the comprehension phases of the ground truth (for the same subjects evaluated in the previous section).Fig. 9 depicts the accuracy values for the detection of low-level phases.The highest accuracy is achieved by CRF with 83%, whereas the baseline stops at 76%.
For evaluating the time performance, we trained on the data of 4 subjects and used the fifth for the prediction.The training and prediction procedure is repeated 5 times, and the average is taken.Table 3 depicts the results.The prediction for the Markovian (baseline) approach is very slow, because it employs a brute-force method.CRF represents a very good trade-off: even though it is not as fast as HMM, both the training and the prediction times are very good.
In conclusions, we can answer question Q2 by showing the performance (both in terms of time requirements and quality) of the different techniques we devised.The CRF approach achieves the best accuracy while showing a good time performance.

Limitations
The main limitation of the approaches presented stems from the assumption that all phase types can be induced by interactions with the modeling platform and eye movements on the screen.While this assumption is realistic in some settings (e.g., teaching 7. Accuracy for detection of high-level phases with different approaches and different window sizes.The legend is shared by all charts.exercises), it might not hold in the general case.However, eye tracking technologies are getting increasingly powerful enabling the mapping of eye tracking data onto real-world objects as well.Thus, it is realistic to assume that our approach can be applied in more dynamic settings in the future.A technical limitation of the approaches is the potential impact of the window size on the identification of high-level phases: a bad decision on this regard can lead to poor results (also for low-level phases).
Finally, it is important to mention that the manual labeling of the gold standard represents a limitation regarding the generalizability of our experimental results.Such labeling might be affected by human perception since it is a manual process.
To mitigate this problem we asked two experts to independently label the instances and reach consensus.

Conclusion and future work
The process underlying the creation of process models has recently received a lot of attention.Specifically, the importance of process models, for communication and documentation purposes, drives the need for improving their correctness and understandability.
Historically, the analysis of the process of process modeling relied just on the recording of user's interactions with the modeling tools.Exploiting this data source, the literature reports techniques to gather information regarding the different modeling phases taking place (i.e., comprehension, modeling, reconciliation).With the contribution presented in this paper, we showed how to process the data in a completely new fashion, also incorporating a new source of information, i.e. eye tracking, resulting in a twofold improvement of the state of the art.On the one hand, we are now able to extract phase types that were not identifiable with state of the art techniques (i.e., problem understanding, method finding, semantic and syntactic validation), thus delivering additional useful information to the analysis.On the other hand, the new analysis technique and the exploitation of the new data source resulted in superior accuracy in identifying the phase types with respect to the state of the art.
Leveraging the technique reported in this paper allows the automatic detection of the different phases a user is performing during a modeling task.The automatic phase detection will allow a more fine-grained analysis of modeling sessions.Moreover, this can be used to provide better modeling experience through phase-specific modeling support which, in turn, will result in better process models.In particular, the automatic phase detection can provide important contextual information to be able to identify how to best support the modeler in her specific situation through feedback, interventions, or adaptations.This represents the main long-term goal and future working direction of the paper.At the same time, we plan to investigate different possibilities to improve the overall quality of the phase detection, as well as the introduction of new techniques, resulting in even more accurate inferences.

Appendix C. Implementation details
The approach presented in this paper has been implemented using different programming languages and techniques.The high-level phase classification is completely coded in Python and the open-source ecosystem SciPy. 4We extensively used the scikit-learn package [32] for training and validating the different classifiers.
For the low-level phase classification, Matlab 5 was used for the HMM.The R programming environment 6 has been adopted to program the Markov Chain algorithm, and Python and the sklearn-crfsuite (which is based on CRFsuite [33]) package was exploited for CRF.

Fig. 1 .
Fig. 1.Example sequence of phases of a process modeling session.

Fig. 3 .
Fig. 3. Combined visualization of eye tracking information and model interactions.

Fig. 4 .
Fig.4.Indexes of all possible cuts of a sequence and the set of all possible partitions for the same sequence.

Definition 3 . 6 (
Fig. A.10b, it is possible to identify the following log of fixations: ⟨( 1  ,  1  , ), ( 2  ,  2  , ), ( 3  ,  3  , ), ( 4  ,  4  , )⟩.Log of Transitions).A log of Transitions   = ⟨ 1 ,  2 , … ,   ⟩ is a sequence of transitions progressively ordered according to their start timestamp.Considering the example of Fig. A.10b, it is possible to identify the following log of transitions: ⟨( Fig. A.10c, the result of the first line of the algorithm is the following sequence of high-level phases (wrt the figure, phases  1 and  2 are merged into a single comprehension phase):⟨( 1  ,  1  , ℎ), ( 2  ,  2  , ), ( 3  ,  3  , )⟩.In a second step, the algorithm selects only the comprehension phases (line 2, Alg. 1) and replaces each of them with a sequence of low-level phases (i.e., problem understanding, method finding, syntactic and semantic validation) by applying one of the low-level phase detection algorithms (line 4, Alg.1; for details see Section 3.4).Definition 3.8 (Low-Level Phase).A low-level phase  = (  ,   , ) is a time interval, specified by a start time   and an end time   , with a type  associated.The possible types are  ∈ { , ℎ  , , , }.Considering again the example inFig.A.10c, the final result of the algorithm is the following sequence of phases:

Algorithm 3 : 4 3
Algorithm to detect low-level phases with HMM or CRF Input:   : log of transitions start: start time of window end: end time of window Output: sequence of low-level phases discovered ⊳ Filtering of transitions, to consider just the given time interval 1  ← ⟨(  ,   , aoi  , aoi  ) ∈   |   > start ∧   ≤ end⟩ 2  ← convert T into a sequence of features ⊳ See Section 3. ← SequenceLabeler( ) ⊳ Standard HMM or CRF procedure ⊳ In  we have a sequence (same length as  ) where each component is classified as a low-level phase 4 Merge contiguous components of  with the same classification 5 return

Fig. 6 .
Fig.6.Different steps involved in the identification of low-level phases using HMM or CRF.

Fig. 8 .
Fig. 8. Wall-clock time for training and prediction with different window sizes.

Fig. 9 .
Fig. 9. Comparison of the accuracy for low-level classification for the different techniques devised as well as for the baseline (Markovian).

Fig. A. 10 .
Fig. A.10. Example of a process modeling session with model interaction and eye tracking data as well as the inferred phases.

Algorithm 4 : 1 5⊳
Algorithm to detect low-level phases with Markov Chains Input:   : log of transitions start: start time of window end: end time of window Output: sequence of low-level phases discovered ⊳ Filtering of transitions, to consider just the given time interval 1  ← ⟨(  ,   , aoi  , aoi  ) ∈   |   > start ∧   ≤ end⟩ ⊳ Define the best score and the best sequence 2 score best ← −∞ 3 seq best ← ⟨⟩ 4  ← generate all possible partitions of the sequence  ⊳ See Section 3.foreach partition  ∈  do 6 Iterate through all sub-sequences of the current partition to check its quality

Table 1
List of possible interaction types.
. Formally, in this work, a fixation  = (  ,   , ) is a tuple where   indicates the time the fixation started,   the time the fixation ended, and the area of interest of the fixation , with  ∈ {, , } as described in Section 2.3.
[19]nition 3.4 (Transition).In our context, a transition represents the gaze movement from one area of interest to another one[19].Formally, a transition  = (  ,   ,   ,   ) is a tuple reporting the beginning time of the transition   and the end of it   as well as the starting area of interest   and the target area of interest   .The possible values for the areas of interests are those reported in Section 2.3:   ,   ∈ {, , }.

2 :
Algorithm to detect high-level phases (HighLevelPhases) Construct all possible sliding windows and classify each of them as proper phase 5 foreach window (sw  , sw  ) in between start and end do ExtractFeatures(  ,   ,   , sw  , sw  ) In  there can be contiguous phases with the same type ⊳ 6 F ← 10 Merge phases of  that are contiguous and with the same type 11 return

Table 2
Absolute frequency of phase types in the ground truth and duration of modeling sessions for each subject.

Table 3
Time needed for training and prediction of low-level classifiers.

16 end 17 end 18 return seq best
score  , type ← highest likelihood that  is generated by one of the Markov Chains, and the phase type the M.C. refers to 10 score ← score + score  11 seq ← seq ⋅ ⟨(   ( 0 ),    ( If the current partition is the best so far, update the global values 8 foreach sequence  ∈  do 9 || ), type)⟩ 12 end ⊳ 13 if score > score best then 14 score best ← score 15 seq best ← seq