Process Query Language: Design, Implementation, and Evaluation

Organizations can benefit from the use of practices, techniques, and tools from the area of business process management. Through the focus on processes, they create process models that require management, including support for versioning, refactoring and querying. Querying thus far has primarily focused on structural properties of models rather than on exploiting behavioral properties capturing aspects of model execution. While the latter is more challenging, it is also more effective, especially when models are used for auditing or process automation. The focus of this paper is to overcome the challenges associated with behavioral querying of process models in order to unlock its benefits. The first challenge concerns determining decidability of the building blocks of the query language, which are the possible behavioral relations between process tasks. The second challenge concerns achieving acceptable performance of query evaluation. The evaluation of a query may require expensive checks in all process models, of which there may be thousands. In light of these challenges, this paper proposes a special-purpose programming language, namely Process Query Language (PQL) for behavioral querying of process model collections. The language relies on a set of behavioral predicates between process tasks, whose usefulness has been empirically evaluated with a pool of process model stakeholders. This study resulted in a selection of the predicates to be implemented in PQL, whose decidability has also been formally proven. The computational performance of the language has been extensively evaluated through a set of experiments against two large process model collections.


INTRODUCTION
Through the application of methods and techniques from the field of business process management, organisations can identify, model, analyse, deploy, and diagnose their business processes.This process-oriented thinking provides great benefits as making processes explicit through formal highlevel representations, i.e. process models, allows them to subject these processes to various forms of analysis, to use them as the basis for automated support, and to adapt them more easily as well as more rapidly to continual changes imposed by the organisation's environment, both internal and external.As a consequence, some organisations have collected large numbers of process models and their maintenance poses significant challenges.
Process models tend to evolve over time, new models may merge, which may need to be informed by existing models, and models may need to be merged.To support these activities, it should be possible to query a potentially large process model repository to retrieve models with certain characteristics.Current process model query languages, of which perhaps the most prominent proponent is BPMN-Q, predominantly focus on syntactic aspects of process models.Hence queries are based on paths, which are formed by direct succession relations between model elements.
More powerful than a syntactic approach to querying would be an approach based on semantic relations between tasks, e.g. relations that capture that certain tasks need to be executed in order or in parallel or can never be executed as part of the same process instance.Semantic relations are essential when one needs to explore whether certain process behaviours are possible or not.
The added retrieval power of a semantic query language comes at a price.Semantic relations cover a broad spectrum of inter-task dependencies and may be expressed through temporal logic.Temporal logic is a form of modal logic that includes temporal operators and which can be used to reason over the behaviour of systems over time.Temporal logic is powerful enough to be able to express properties that are undecidable [Esparza and Nielsen 1994].Hence a semantic query language needs to be careful in terms of the semantic inter-task dependencies that it supports.
Decidability of semantic inter-task relations is a factor in choosing which relations to support in a query language, but it is not the one.While some relations are decidable, their use may not be very intuitive to stakeholders, in this case experts that are likely to end up formulating queries over process model collections.It is important that a relation occurs frequently enough in queries to warrant support and that its formal meaning is close to its perceived meaning.Another consideration is that query evaluations are performed in a "reasonable" amount of time as it is anticipated that stakeholders may wish to see the answers to their queries in (almost) real-time.
In this paper a specific process model query language is proposed that is based on a selection of semantic inter-task relations.The Process Query Language (PQL) is a special-purpose programming language for managing process models based on information about process instances that they describe.PQL programs are also called queries.The first version of the PQL language, proposed in this paper, allows formulating search intents for retrieving process models from collections or repositories thereof based on information held in process instances.
The selected semantic inter-task relations supported by PQL, termed predicates in PQL, are shown to be decidable through the application of model checking, an automated technique which, given a finite-state model of a system (e.g., a process model) and a formal property (e.g., a temporal logic formula), systematically checks whether this property holds for (a given state in) that model [Baier and Katoen 2008].In addition, these predicates are validated with domain experts in terms of their perceived usefulness and intuitiveness.To further facilitate query formulation, PQL is not only provided with an abstract syntax but also a concrete one, which is inspired by SQL.The runtime environment for PQL makes uses of indexes to enhance the performance of query evaluation.Indexes are special data structures that improve the speed of computations of behavioral operators trading off time for their construction with space for their storage.Performance of query evaluation is demonstrated through a set of experiments with a real-life process model collection.

Abstract Syntax
This section discusses the syntax of the PQL language in the form of an abstract syntax, which is also often referred to as an (abstract) grammar.The grammar of the PQL language is defined using the notation introduced in [Meyer 1990].In this notation, the abstract grammar of a programming language consists of a finite set of names of constructs and a finite set of productions, each associated with a construct.Each construct describes the structure of a set of objects, also called specimens of the language, using productions of three types; these are aggregate, choice, and list productions.
The top construct of the abstract grammar of the PQL language is Query.It captures the core structure of all PQL programs, i.e., queries.
Query ≜ vars ∶ Variables; atts ∶ Attributes; locs ∶ Locations; pred ∶ Predicate The Query construct (see above, on the left hand side) is defined as an aggregate production composed of four components (see above, on the right hand side); in general, an aggregate production defines a construct that is made of a fixed number of components.The components are separated by semicolons, each preceded by a tag indicating its role within the construct.Thus, every PQL query is composed of variables, attributes, locations, and a predicate, which are distinguished via tags vars, atts, locs, and pred, respectively.Intuitively, a PQL query specifies a search intent to discover the attributes of all process models in the collection of models identified by the locations that satisfy the predicate, where the evaluation of the predicate relies on information stored in the variables.The order in which the various specimens are listed in aggregate productions is irrelevant for the sake of the abstract grammar specification.This order is important in the context of the next section, in which the concrete syntax of the PQL language is proposed.
The Query construct seen above defines a class of PQL queries.One can specify an instance of this class using abstract syntactic expressions.For example, the statement q ≜ Query(vars ∶ vs; atts ∶ as; locs ∶ ls; pred ∶ p) defines a query having vs, as, ls, and p, as variables, attributes, locations, and a predicate, respectively (assuming that all the specimens, i.e., vs, as, ls, and p, are provided).
In PQL, variables, attributes, and locations are defined as list productions, where a list production defines a sequence of zero, one, or more specimens of another construct.
Therefore, a PQL query defines a sequence of zero, one, or more variables, denoted by Variable * ; the asterisk symbol stands for the Kleene star-its standard language theory meaning.Every sequence of attributes must contain at least one attribute, denoted by Attribute + ; note that the asterisk symbol is replaced by a plus sign to signify that the list of locations cannot be empty.Similarly, every sequence of locations must contain at least one location specimen.
PQL introduces a dedicated construct, denoted by Variable, to define variables.
Variable ≜ name ∶ VariableName; tasks ∶ SetOfTasks A PQL variable associates a symbolic name with a set of tasks, or to be more precise, with a collection of abstract concepts that represent tasks.Tasks are introduced in the PQL language to refer to atomic units of observable behavior that are captured in process models.Each variable is an aggregate of two constructs: a variable name (name ∶ VariableName), and a collection of tasks (tasks ∶ SetOfTasks).Such a separation of the variable name from its associated content allows the name to be used independently of the exact information it represents.Thus, a variable name can be bound to a set of tasks during run time, and the content of the set may change during evaluation of the query.When a predicate of some PQL query gets evaluated, every variable name that is mentioned in the predicate is replaced by the corresponding set of tasks.
The PQL language introduces the Attribute construct to allow specifying those process model properties that must be retrieved in a response to a successful query matching exercise.

Attribute ≜ Universe AttributeID AttributeName AttributeModel
A PQL attribute can be anything that identifies a single property or a collection of properties of a process model.In the first version of the PQL language, the Attribute construct is associated with a choice production that allows for four alternatives; in general, a choice production defines a construct as a set of alternatives.The alternatives are separated by vertical bar symbols.Every attribute is either the universe attribute, denoted by Universe, the identifier property, denoted by AttributeID, the name property, denoted by AttributeName, or a formal specification property, denoted by AttributeModel.The universe attribute has a special meaning.It refers to a collection of all properties that are associated with process models in the process model repository.
In the PQL language, locations are used to address process models that are of interest to the search intents of queries, i.e., those models that should be matched against the queries.

Location ≜ Universe LocationID LocationDirectory
A location can generally be anything that identifies a single process model or a collection of process models.It is defined as a choice production that allows for three alternatives.A location is either the universe location, denoted by Universe, an identifier location, denoted by LocationID, or a directory location, denoted by LocationDirectory.The universe location is designed to address all process model in the scope of the query (usually, all process models in the repository).Identifier locations are introduced in the PQL language to allow fine-grained targeting of models based on their unique identifiers.We assume that repositories do indeed tag models with unique identifiers, e.g., universally unique identifiers (UUIDs) or integer identifiers.Finally, a directory location allows addressing models that are stored in a particular directory of the repository.For example, a directory location can be a name of a directory, specified as a character string, an URI [URI Planning Interest Group 2001], or an XPath expression [W3C XSL/XML Query Working Groups 2007].
PQL provides several alternatives for specifying a set of tasks.A set of tasks can be defined as an enumeration of tasks, a result of standard operations on sets of tasks, information stored in a variable, a construction macro, or a dynamically-valued constant.These various possibilities are captured in the SetOfTasks construct of the PQL grammar.

SetOfTasks ≜ VariableName Universe SetOfTasksLiteral SetOfTasksConstruction Union Intersection Difference
The SetOfTasks construct is defined as a choice production.One can use the VariableName construct to refer to the set of tasks associated with a name of some variable.Alternatively, one can specify a set of tasks using the Universe construct.The Universe construct, when used in the context of a reference to a set of tasks, constitutes a dynamically-valued constant that refers to the set of all tasks of the process model currently being matched to the query.The content of this set should be created at initialization time, freshly for every new process model that gets matched to the query.
The PQL language proposes a notation to specify set of tasks literals, i.e., a notation for representing sets of tasks as fixed values.Set of tasks literals can be defined using the SetOfTasksLiteral construct, which is specified as a list production of zero, one, or more tasks.
SetOfTasksLiteral ≜ Task * As mentioned above, tasks are abstract representations of atomic units of observable behavior.
Task ≜ label ∶ Label; sim ∶ Similarity A PQL task is defined as an aggregate of two components: a label, denoted by label ∶ Label, and a similarity degree threshold, denoted by sim ∶ Similarity.The idea is that given a label of a task one may be interested in all the tasks of which the label has (at least) a certain degree of similarity to the given label.
Another way to specify a set of tasks is to construct it.For this purpose, one can rely on the SetOfTasksConstruction construct, which is defined as a choice production below.

SetOfTasksConstruction ≜ UnaryPredicateConstruction BinaryPredicateConstruction
Given a set of tasks and a unary behavioral primitive, the UnaryPredicateConstruction construct can be used to compose a set of tasks that contains every task from the given set and for which the given behavioral primitive holds.The given behavioral primitive must be evaluated in the context of the process model that is being matched to the query.Similarly, the BinaryPredicateConstruction construct is introduced in the PQL language to allow selecting those tasks from a given set of tasks for which certain binary behavioral primitive holds, either with at least one or with all tasks taken from another given set of tasks.The choice of a quantifier type, either the existential or universal, to be used during the above described selections is implemented via the AnyAll construct.
UnaryPredicateConstruction ≜ name ∶ UnaryPredicateName; tasks ∶ SetOfTasks BinaryPredicateConstruction ≜ name ∶ BinaryPredicateName; tasks 1 ∶ SetOfTasks; tasks 2 ∶ SetOfTasks; q ∶ AnyAll AnyAll ≜ Any All Both, the UnaryPredicateConstruction construct and the BinaryPredicateConstruction construct, are associated with aggregate productions.The AnyAll construct is specified as a choice between the Any qualifier versus the All qualifier, where Any and All stand for the existential quantifier type and the universal quantifier type, respectively.The PQL language uses the UnaryPredicateName construct and the BinaryPredicateName construct to refer to unary behavioral primitives and binary behavioral primitives, respectively.The first edition of the PQL language supports two unary and six binary behavioral primitives.These are the CanOccur and AlwaysOccurs unary behavioral primitives, and the CanConflict, CanCooccur, Conflict, Cooccur, TotalCausal, and TotalConcurrent binary behavioral primitives.
Finally, a set of tasks can be constructed from other sets of tasks via the application of the fundamental set operations of union, intersection, and difference, denoted by the Union, Intersection, and Difference construct, respectively.The PQL language proposes several ways to specify predicates; all the options are captured in the choice production that is associated with the Predicate construct.The UnaryPredicate construct and the BinaryPredicate construct are introduced in the PQL language to allow checking the unary behavioral primitives and the binary behavioral primitives, respectively.Both these constructs are aggregations of a name (specified by the UnaryPredicateName construct or the BinaryPredicateName construct) and a respective number of Task constructs; one for the UnaryPredicate construct and two for the BinaryPredicate construct.
The PQL language utilizes a well-known mechanism of macros for combining results of several UnaryPredicate or BinaryPredicate checks into a result of a single statement.
The aggregate production associated with the UnaryPredicateMacro construct is composed of a reference to a unary behavioral primitive (name ∶ UnaryPredicateName), a set of tasks (tasks ∶ SetOfTasks), and a quantifier (q ∶ AnyAll).Intuitively, a single macro statement p ≜ UnaryPredicateMacro(name ∶ n; tasks ∶ ts; q ∶ x) is equivalent to a complex check of whether it holds that for at least one (if x is set to Any) or for every (if x is set to All) task t in set of tasks ts statement UnaryPredicate(p.name;task ∶ t) evaluates to true.Similarly, one can rely on the BinaryPredicateMacro construct to combine results of multiple BinaryPredicate checks.
BinaryPredicateMacroTaskSet ≜ name ∶ BinaryPredicateName; task ∶ Task; tasks ∶ SetOfTasks; q ∶ AnyAll BinaryPredicateMacroSetSet ≜ name ∶ BinaryPredicateName; tasks 1 ∶ SetOfTasks; tasks 2 ∶ SetOfTasks; q ∶ AnyEachAll; AnyEachAll ≜ Any Each All The BinaryPredicateMacroTaskSet construct is designed to allow checking whether a certain binary behavioral primitive (name ∶ BinaryPredicateName) holds between a given task (task ∶ Task) and either at least one (if the AnyAll construct is instantiated with the Any specimen) or every (if the AnyAll construct is instantiated with the All specimen) task in a given set of tasks (tasks ∶ SetOfTasks).the BinaryPredicateMacroSetSet construct can be used to check whether a binary behavioral primitive of interest evaluates to true for certain pairs of tasks in the Cartesian product of two given sets of tasks.Note for the option to use the Each qualifier as a specimen of the AnyEachAll construct in the respective production above.When employed, this option induces a check of whether for every task in one given set of tasks the specified behavioral relation holds with some task from the other given set of tasks.
The PQL language supports checks of basic binary relations between sets of tasks.These are captured by the choice production associated with the SetPredicate construct.The PQL language allows checking if a task is a member of a given set of tasks.This can be accomplished using the TaskInSetOfTasks construct, which is specified as an aggregation of a task (task ∶ Task) and a set of tasks (tasks ∶ SetOfTasks).Moreover, the PQL language allows checking several binary relations between sets of tasks using the SetComparison construct.The SetComparison construct is composed of two sets of tasks (tasks 1 ∶ SetOfTasks and tasks 2 ∶ SetOfTasks) and a reference to a comparison operator (oper ∶ SetComparisonOperator).The PQL language supports five comparison operations.They specify checks of whether two sets of tasks are identical (Identical), different (Different), overlap (OverlapsWith), or whether one set of tasks is a subset (SubsetOf) or a proper subset (ProperSubsetOf) of the other set of tasks.
As PQL is designed to utilize three-valued reasoning, it operates with three truth values: true, false, and unknown.This is reflected in the three literals of the choice production associated with the TruthValue construct, which is proposed below.

TruthValue ≜ True False Unknown
To allow complex logical statements on atomic propositions, PQL supports standard logical operations.These are negation (Negation), conjunction (Conjunction), and disjunction (Disjunction).
To permit for a three-valued logic that is used with PQL to be functionally complete, the language includes a test of whether a given three-valued logic value is unknown.This check is reflected in the IsUnknown option of the LogicalTest construct proposed below, which for the sake of completeness allows for the total of six different tests of whether a three-valued logic value is or is not equal to a certain truth value.

LogicalTest ≜ IsTrue IsNotTrue IsFalse IsNotFalse IsUnknown IsNotUnknown
For a grammar of a language to be complete, all its constructs must be specified in terms of well-defined components, called the terminal constructs.The following constructs are the terminal constructs of the PQL grammar: Any, All, Each, Universe, AttributeID, AttributeName, AttributeModel, Identical, Different, OverlapsWith, SubsetOf, ProperSubsetOf, True, False, Unknown, as well as all the constructs that are parts of the choice productions associated with the UnaryPredicateName and BinaryPredicateName constructs.All the above mentioned constructs do not have an internal structure and, thus, are atomic constructs of the PQL language.
Several of the proposed PQL constructs can be defined in terms of well-known sets.For instance, in the discussion above, we hint at the fact that LocationID can be specified as an integer.This can be captured rigorously in the production LocationID ≜ value ∶ Z, where Z is the symbol often used in mathematics to denote the set of all integers.Similarly, we specify LocationDirectory, VariableName, and Similarity, as LocationDirectory ≜ value ∶ S, VariableName ≜ id ∶ V, and Similarity ≜ value ∶ respectively; here, S and V are the set of all character strings and the set of all legal variable names, respectively.Note that set V is defined in the next section.
Some of the PQL constructs are still not defined in terms of terminal constructs.The PQL language trivially defines the Negation construct and all the six options associated with the LogicalTest construct in terms of a single Predicate component, e.g., Negation ≜ pred ∶ Predicate, IsTrue ≜ pred ∶ Predicate, etc. Finally, for the sake of space considerations, at this stage we omit rigorous definitions of five PQL constructs: Conjunction, Disjunction, Union, Intersection, and Difference.Intuitively, Conjunction and Disjunction can be defined as sets of predicates, whereas Union, Intersection, and Difference can be specified as collections of sets of tasks.However, any definition of priorities for the operations that the above stated constructs represent in terms of grammar rules is rather lengthy and is driven by semantic, rather than syntactic, rules.In the next section, we discuss priorities of various operations that are supported in PQL, whereas missing rigorous specifications of the five mentioned constructs can be found in Appendix A.

Concrete Syntax
The abstract syntax of PQL is independent of any particular representation.This section proposes a mapping from the abstract syntax of PQL to its specific encoding.This encoding constitutes one possible concrete syntax of the PQL language, i.e., its machine-and human-readable representation.
The first concrete syntax of the PQL language proposed in this section is inspired by SQLa programming language for managing data stored in a relational database management system (DBMS) [Date and Darwen 1997].Being inspired by SQL, we intent to keep the core structure of concrete PQL queries as similar as possible to that of SQL queries and to reuse SQL keywords in PQL, given that the contexts are similar.The reason for this is threefold: ○ Despite addressing different domains, i.e., dynamic processes versus static data, both languages serve the same purpose-the purpose of querying for information.Note that SQL was originally proposed to retrieve data stored in quasi-relational DBMS [Chamberlin and Boyce 1974].○ SQL is a widely used standard that is supported by just about every DBMS on the market.As a result, its syntax is well-recognized by technical specialists and analysts.By closely following the concrete syntax of SQL, PQL becomes readily usable by a wide range of stakeholders.○ As suggested by several interviewees, it would be beneficial for the syntax of the envisioned query language to resemble that one of SQL.For example, one interviewee commented: "From an overall strategic point of view it'll bring a lot of benefits because different parts of the organization will be able to work together by using some kind of a structured query language (SQL)".Given a construct of an abstract grammar, one can specify its concrete syntax as a function that yields all its specific forms.In this section, PQL is defined as a textual language.Hence, for each PQL construct, its concrete syntax is given as a function that takes a specimen of the respective abstract construct as input and returns a collection of character strings that are accepted as its concrete encodings.We shall denote such a function by the name of the respective construct with subscript c.
For example, the concrete syntax of a specimen of the Query construct is defined as follows.
Query c (q ∶ Query) ≜ Variablesc(q.vars)'❙❊▲❊❈❚' Attributesc(q.atts) '❋|❖▼' Locationsc(q.locs)('❲❍❊|❊' Predicatec(q.pred))?';' We use regular expressions [Aho and Ullman 1992] to define the concrete syntax of PQL specimens.Hence, as per the above definition, a PQL query is a character string that starts with a specification of variables, followed by the ❙❊▲❊❈❚ keyword, followed by a specification of attributes, followed by the ❋|❖▼ keyword, followed by a specification of locations, followed by the ❲❍❊|❊ keyword, followed by a specification of a predicate, followed by the semicolon mark, i.e., ';'.There can be an arbitrary number of whitespace characters between any two subsequent components of a query string.The order of various components is fixed.Note that the presence of the ❲❍❊|❊ clause in a PQL query is optional, i.e., the ❲❍❊|❊ keyword and the specification of the predicate can be skipped.The reader might have already noticed that the core structure of a PQL query is similar to that one of an SQL query that is signified with the declarative ❙❊▲❊❈❚ statement and is used to formulate an intent for retrieving data from one or more database tables or expressions.
Specimens of PQL constructs that are associated with list productions must be encoded as string concatenations of concrete forms of their components and whitespace characters.Often, we inject special symbols between every two subsequent components and/or at the beginning and end of the respective encodings.For example, the concrete syntax of a list of variables is defined as follows.
Variablesc(vs ∶ Variables) ≜ isEmpty(vs) ?'' ∶ Variablec(vs.FIRST) Variablesc(vs.TAIL) That is, the encoding of the empty list of variables is the empty string.However, if a list of variables contains at least one element, its encoding is constructed as a concatenation of a concrete form of its first element, denoted by vs.FIRST, and an encoding of the list of its all other elements, denoted by vs.TAIL.The concrete syntax of a PQL variable is defined below.
The concrete syntax of all other specimens of PQL constructs that are associated with list productions is defined similar to that one of the Variables construct seen above.However, all these encodings expect to include special symbols between every two subsequent components.This is the comma symbol, i.e., ',', for specimens of Attributes, Locations, and SetOfTasksLiteral, and the PQL keywords ❯◆■❖◆, ■◆❚❊|❙❊❈❚, ❊❳❈❊P❚, ❆◆❉, and ❖|, for specimens of Union, Intersection, Difference, Conjunction, and Disjunction, respectively.Additionally, every encoding of a specimen of the SetOfTasksLiteral construct must begin with the opening curly bracket, i.e., ' {', and end with the closing curly bracket, i.e., '}'.For instance, the character string '{"Buy item","Purchase product"}' is a valid encoding of a specimen of the SetOfTasksLiteral construct that contains two elements; here, we follow the standard notation for specifying fixed sets.In the example above, strings "Buy item" and "Purchase product" are valid encodings of tasks.In general, the concrete encoding of a PQL task is defined as follows.
Labels of PQL tasks are always enclosed in double quotes.A label can be preceded by the tilde symbol, i.e., '∼', or succeeded by an encoding of a similarity degree threshold enclosed in square brackets.The tilde symbol denotes the fact that one is interested in all the tasks of which the label has a degree of similarity to the specified label that is equal or larger than some preconfigured value.
A degree of similarity must be specified as a decimal representation of a real number greater or equal to zero and less than or equal to one, e.g., 0.5 or .95.
Every specimen of a construct that is associated with a choice production is a specimen of one of the constructs from the list of alternatives of the choice production.Hence, a concrete syntax of an abstract grammar can (and often does) omit special encodings to signify choice productions, which is the case for the concrete syntax of PQL that is being proposed here.Thus, in the sequel, we only propose concrete encodings for the remaining aggregate productions of the PQL grammar.
A specimen of the Attribute construct is a specimen of one out of four terminal constructs.These are Universe, AttributeID, AttributeName, and AttributeModel.In PQL, they are denoted by character strings '*', 'id', 'name', and 'model', respectively.Similarly, a specimen of the Location construct is a specimen of either Universe, or LocationID, or LocationDirectory.
A specimen of the SetComparison construct can be specified in this concrete syntax of PQL as two encodings of sets of tasks with a representation of a comparison operator in between.These encodings rely on the use of the PQL keywords ◆❖❚, ■❙, ❚|❯❊, ❋❆▲❙❊, ❯◆❑◆❖❲◆, and the concrete encoding of the Predicate construct, which is defined by concrete encodings of all the alternatives associated with the corresponding choice production.
We envision introduction of other specific encodings of the abstract syntax of the PQL language, e.g., a visual encoding of PQL queries.We believe that availability of different concrete encodings will make the PQL language accessible to a wider audience.