Reachability analysis via orthogonal sets of patterns

Rule-based modelling languages, such as Kappa, allow for the description of very detailed mechanistic models. Yet, as the rules become more and more numerous, there is a need for formal methods to enhance the level of conﬁdence in the models that are described with these languages. We develop abstract interpretation tools to capture invariants about the biochemical structure of the bio-molecular species that may occur in a given model. In previous works, we have focused on the relationships between the states of the sites that belong to the same instance of a protein. This comes down to detect for a speciﬁc set of patterns, which ones may be reachable during the execution of the model. In this paper, we generalise this approach to a broader family of abstract domains that we call orthogonal sets of patterns. More precisely, an orthogonal set of patterns is obtained by reﬁning recursively the information about some patterns containing a given protein, so as to partition the set of occurrences of this protein in any mixture. We show that orthogonal sets of patterns oﬀer a convenient choice to design scalable and accurate static analyses. As an example, we use them to infer properties in models with transport of molecules (more precisely, we show that each pair of proteins that are connected, always belong to the same compartment), and models involving double bindings (we show that whenever a protein of type A is bound twice to proteins of type B , then the protein A is necessarily bound twice to the same instance of the protein B ).


Introduction
Mechanistic models of signalling pathways suffer from a large combinatorial complexity that is due to potential bindings between proteins and post-translational This material is based upon works sponsored by the Defense Advanced Research Projects Agency (DARPA) and the U. S. Army Research Office under grant number W911NF-14-1-0367.The views, opinions, and/or findings contained in this article are those of the authors and should not be interpreted as representing the official views or policies, either expressed or implied, of the Defense Advanced Research Projects Agency, or the U. S. Department of Defense. 1 Email: feret@ens.fr 2 Email: quyen@di.ens.frtransformations of proteins.Rule-based modelling languages, such as Kappa [11] or BNGL [2], allow for the description of very detailed mechanistic models, thanks to context free rewrite rules.Yet, as the rules become more and more numerous, there is a need for formal methods to enhance the level of confidence in the models that are described with these languages.
Our classical pipeline to check the consistency of models consists in combining static analysis [9,13,15] and causality analysis [10,6].Static analysis provides automatically and without executing the model, an over-approximation of its potential behaviour.When static analysis warns about an unexpected behaviour, causal analysis can be used to check whether this is a true alarm and, if so, to understand its origin.Then, the modeller can update the model accordingly.
In previous works, we have used abstract interpretation [4,5] to formalise static analyses that can capture the relationships between the potential states of the sites that belong to a same instance of a protein [9,13].In this paper, we generalise this approach to a broader family of abstract domains that we call orthogonal sets of patterns.Orthogonal sets of patterns can be obtained by refining recursively the information about a given pattern so as to partition the set of occurrences of a given protein in any mixture.We show that orthogonal sets of patterns offer a convenient choice to design scalable and accurate static analyses.As an example, we use them to infer properties in models with transport of molecules (more precisely, we show that each pair of proteins that are connected, always belong to the same compartment), and models involving double bindings (we show that whenever a protein of type A is bound twice to proteins of type B, then the protein A is necessarily bound twice to the same instance of the protein B).
We have integrated our framework within an open source static analyser for Kappa models [3], and tested it against models in development.The computation time for these analyses demonstrates the scalability of our approach.
The paper is organised as follows.In Sect.2, we provide some case studies to illustrate the goal of our analysis.In Sect.3, we recall the semantics of Kappa.In Sect.4, we describe a generic abstraction of bio-molecular mixtures by the means of sets of patterns and we introduce the notion of orthogonal sets of patterns.In Sect.5, we use orthogonal sets of patterns to abstract the set of reachable mixtures in Kappa models.

Case studies
We introduce three toy models, so as to illustrate the kinds of properties concerning the biochemical structure of molecular species we are interested in.

A model with relationships among the sites of a protein
Firstly, we introduce a model in order to sketch how relationships among the sites of a given protein may emerge from the description of the mechanistic interactions between proteins.In this model, we consider three kinds of protein.Let us call them A, B , and C .Proteins of kind A have a single binding site that we call r .Proteins of kind B have three identified sites: two binding sites that we call l and  r , and a phosphorylatable site that we call x .Proteins of kind C have a single binding site that we call l .We describe the instances of these proteins graphically.Each protein of kind A is drawn as a square with its site r on the right.Each protein of kind B is drawn as a rectangle, with its site l on the left, its site r on the bottom right, and its site x on the top right.Each protein of kind C is drawn as an oval shape with its site l on the left.When the site x of a protein B is phosphorylated, we annotate it with a black smaller disk, otherwise, we annotate it with a white disk.Pairs of binding sites that are linked together are depicted thanks to an edge between these sites, whereas free sites are annotated with the symbol .In a rule, it is also possible to specify that a site is bound without specifying its partner site, in such a case, we annotate the site with a dandling edge.
The model is specified as a set of rewrite rules (see Fig. 1) that describe the potential biochemical interactions between the sites of the proteins.We consider the following interactions.A protein A and a protein B may bind together provided that their respective sites r and l are free (see Fig. 1(a)).When the site l of a protein B is bound, this protein may get phosphorylated (see Fig. 1(c)).Lastly, when the protein B is phosphorylated, this protein may bind with a protein C , provided that the site r of the protein B and the site l of the protein C are both free (see Fig. 1(e)).These interactions are reversible under specific conditions.We assume that a protein A and a protein B may dissociate from each other, only if the protein B is not phosphorylated (see Fig. 1(b)).In other words, the phosphorylation of a protein B by a protein A stabilises the binding between these two proteins.Moreover, we assume that a protein B may be dephosphorylated only if it is not bound to a protein C (see Fig. 1(d)).That is to say that the binding between a protein B and a protein C blocks the phosphate ion that is docked to this protein B .Lastly, a binding between a protein B and a protein C may be released unconditionally (see Fig. 1(f)).
In this model, because the phosphorylation of a protein B blocks its potential dissociation from a protein A, any protein B that is phosphorylated is necessarily bound to a protein A. In the same way, because the binding between a protein B and a protein C blocks the potential dephosphorylation of this protein B , any protein B that is bound to a protein C is necessarily phosphorylated (and hence bound to a protein A).These are the properties of interest we would like to compute automatically from the model.We notice that views analysis [13,9] can capture these properties already.Views analysis is indeed a particular case of the framework that we are describing in this paper.

A model with transportation
Now we introduce another model to illustrate the kind of properties that may arise when proteins may be transported within a finite set of compartments.In this model, we consider two compartments: the cytoplasm and the nucleus.We also consider two kinds of protein D and E .The protein E is assumed to be a transport molecule.Its purpose is to grab a protein D in the cytoplasm, so as to take it into the nucleus.Each protein has two sites l and b .The site l is used to encode the location of the protein, whereas the site b is a binding site.Each protein D is drawn as a rectangle.Its site l is depicted at the top right of the protein and its site b at the bottom right.Each protein E is drawn as an oval shape.Its site l is depicted at the top left of the protein and its site b at the bottom left.Furthermore, we annotate the site l of a protein with a white disk whenever this protein is in the cytoplasm, and with a black disk whenever this protein is in the nucleus.Now we describe the rewrite rules that define the potential mechanistic interactions between the proteins of our models.These rules are depicted in Fig. 2. A protein E that is not bound to a protein D may freely move from the cytoplasm into the nucleus, and conversely (see Fig. 2(a)).Such a move is encoded as a change of the state of the site l of this protein.A protein D and a protein E may bind to each other when they are both located in the cytoplasm (see Fig. 2(b)).Then, when bound together, a protein E may take the protein D inside the nucleus (see Fig. 2(c)).We notice that, in this rule, the state of the site l of D and the state of the site l of E are replaced by a black disk simultaneously, which encodes the move of the proteins.Two proteins that are bound may dissociate from each other without any further conditions (see Fig. 2(e)).Moreover, we assume that a protein D that is no longer attached to a protein E may return into the cytoplasm (see Fig. 2(d)).We also assume that a dimer formed of a protein D and a protein E may not move from the nucleus to the cytoplasm.
In this model, the main property of interest is that when two proteins are bound together, they are necessarily located in the same compartment.This kind of properties cannot be captured by views analysis [13,9] because it concerns the state of two sites that do not belong to the same protein.The framework that we propose in this paper can capture automatically these properties, hence extending the   framework of views analysis.

A model with double binding
Our last case study involves two kinds of protein which may form a double binding between each other.Double bindings between proteins are important because they make dimers more stable.
In this model, we consider two kinds of protein F and G.Each protein has two binding sites x and y .Each protein F is drawn as a square with its site x at the top right and its site y at the bottom right.Each protein G is drawn as a square with its site x at the top left and its site y at the bottom left.
The rules for this model are given in Fig. 3. Two proteins F and G may bind their respective sites x providing that either all their sites are free (see Fig. 3(a)), or that they are already connected together via their sites y (see Fig. 3(b)).In the same way, two proteins F and G may bind their sites y providing that either all their sites are free (see Fig. 3(c)), or that they are already connected together via their sites x (see Fig. 3(d)).Moreover, any bond may be released without any condition (see Figs. 3(e) and 3(f)).
We notice that whenever both sites of a protein are bound, then they are necessarily bound to the same instance of a protein.This is the property of interest for this model.Our framework is able to capture it automatically.

Kappa
In this section, we recall the operational semantics of Kappa.

Site-graphs
We start by introducing site-graphs which are used to describe bio-molecular species and patterns.
Firstly we define the signature of a model.Agent types in Σ ag denote agents of interest, as kinds of proteins for instance.A site identifier in Σ st represents an identified locus for capability of interactions between agents.Internal state identifiers in Σ int are special attributes which encode potential state configurations, as the phosphorylation state, the ubiquitination state, or the methylation state.Each agent type A is associated with a set of sites which may bear an internal state Σ int ag-st (A) and a set of sites which may be linked Σ lnk ag-st (A).We assume without any loss of generality that Σ lnk ag-st (A) ∩ Σ int ag-st (A) = ∅, for every A ∈ Σ ag and we write Σ ag-st (A) for the set of sites Σ lnk ag-st (A) Σ int ag-st (A).
Example 3.2 We give the signatures for our three case studies (e.g.see Sect. 2).
(i) In the first case study, the signature is the following one: In the model with transportation of proteins, the signature is the following one: (iii) In the model with double bonds, the signature is the following one: For the rest of the paper, we assume that we are given a signature Σ.In Kappa, both the state of the system and the patterns which are used to describe transformation rules are defined as site-graphs.
(a) An agent B with the site l bound and the site r free.
(b) An agent B with the site r free and the site l bound to the site r of an agent A. A site (n, i) ∈ S such that i ∈ Σ int ag-st (type(n)) is called a property site, whereas a site (n, i) ∈ S such that i ∈ Σ lnk ag-st (type(n)) is called a binding site.Whenever L(n, i) = , the binding site (n, i) is free.Various levels of information may be given about the sites that are bound.Whenever L(n, i) = −, the binding site (n, i) is bound to an unspecified site.Whenever L(n, i) = (n , i ) (and hence L(n , i ) = (n, i)), the sites (n, i) and (n , i ) are bound together.
For a site-graph G, we write as A G its set of agents, type G its typing function, S G its set of sites, L G its set of links, and p G the valuation of its property sites.

Example 3.4 The tuples (A
. are two site-graphs for the signature of the first case study (e.g.see Fig. 2

.1).
The site-graph G is depicted in Fig. 4(a) and the site-graph H is drawn in Fig. 4(b).
In Exa.3.4, it is worth noticing that in the site-graph H, the state of every site in every protein is fully defined.The site-graph is called a chemical mixture.This is not the case for the site-graph G, since not only the state of the site l of the protein B only partially specified, but also the site x is missing.The site-graph G is called a pattern.

Relations among site-graphs
Site-graphs can be more or less specific.We introduce some materials in order to compare site-graphs according to different levels of refinement.
Two site-graphs may be related by structure-preserving functions, which are called homomorphisms.

Definition 3.5 [homomorphisms] A homomorphism
We notice that a homomorphism from a site-graph G to a site-graph H, and a homomorphism from the site-graph H to another site-graph I compose.The result of this composition is a homomorphism from the site-graph G to the site-graph I.
An embedding is a homomorphism that is induced by an injective function, defined as follows: An embedding f from a site-graph G to a site-graph H, is usually denoted as f : G 1 / / H.We also use the notation G 1 / / H to express the fact that there exists an embedding from the site-graphs G to H, without specifying the embedding.
We notice that, if it is defined, the composition between two embeddings is also an embedding.
It is worth noticing that the notion of embedding between site-graphs is not the same as the notion of embedding between graphs.The major difference is that in a site-graph we have to specify explicitly when a site is free.As a consequence, a site that is free may be embedded only into a site that is free as well.It is sometimes convenient to relax the definition of embedding so as to allow sites that are free to be mapped to sites with arbitrary binding state.We introduce the notion of weak embedding for this purpose.Definition 3.7 [weak embedding] A weak embedding from a site-graph G to a sitegraph H is an embedding from the site-graph Ĝ to the site-graph H, where the site-graph Ĝ is defined by: We notice that every embedding is also a weak embedding.
Example 3.8 We give in Fig. 5, an example of an embedding from a site-graph to another one, an example of a weak embedding (that is not an embedding), and an example of a homomorphism (that is not an embedding)./ / (c) A homomorphism.Fig. 5.An embedding, a weak embedding (which is not an embedding), and a homomorphism (which is not an embedding).

Gluing patterns
Thanks to embeddings, we can glue some patterns together along a shared region.We explain this construction as follows.
In order to glue two patterns G and H together, we need a common region G ∩ that embeds inside both patterns G and H, via two embeddings that we call respectively ψ G and ψ H .The pair (ψ G , ψ H ) is called a span of embeddings.Intuitively, it identifies pairs of agents which we are going to fuse in the gluing.More precisely, given n ∈ A G∩ , the agent ψ G (n) in G and the agent ψ H (n) in H are going to be merged.Not all spans define a gluing: we have to check if the context of the agents that are fused together is compatible.This is achieved by requiring the existence of a site-graph G ∪ such that both G and H embed into G ∪ , via two embeddings that we call respectively φ G and φ H , and that shall satisfy the constraint: In such a case, the pair (φ G , φ H ) is called a cospan of embeddings.The site-graph G ∪ is a good candidate to define the gluing of the site-graphs G and H according to the span (ψ G , ψ H ), however we have to check that on the first hand, any site in G ∪ either comes from the site-graph G, or from the site-graph H, and on the second hand that we have fused as few pairs of sites as it was imposed by the span.Roughly speaking, the cospan (ψ G , ψ H ) should be minimal.
We give the formal definition of a gluing between two patterns, as follows: Definition 3.9 We call a gluing between two site-graphs G and H, any tuple (G ∩ , ψ G , ψ H , φ G , φ H , G ∪ ) such that: (i) G ∩ and G ∪ are two site-graphs; (ii) ψ G is an embedding from the site-graph G ∩ to the site-graph G; (iii) ψ H is an embedding from the site-graph G ∩ to the site-graph H; (iv) φ G is an embedding from the site-graph G to the site-graph G ∪ ; (v) φ H is an embedding from the site-graph H to the site-graph In Def.3.9, the use of a homomorphism, may sound surprising.It is used not only to ensure that any information in G ∪ comes either from G and H, but also to ensure that we only fuse as few pairs of agents as necessary.
Example 3.10 We give an example of a gluing between two site-graphs in Fig. 6.More precisely, we glue together two patterns, the first one is made of a protein F and a protein G bound together by their respective site x whereas the second one is made of a protein F and a protein G bound together by their respective site y , by fusing their respective proteins F .As a result, we get a site-graph made of one protein F and two proteins G, the site x of the protein F being bound to the site x of one of the protein G and the site y of the protein G being bound to the site y of the other protein G.
It could have been possible to also fuse the two instances of the protein G, but it would have led to a more specific gluing, as showed by the existence of a homomorphism from the less specific gluing to the more specific one, making the diagram commute.

Transformation between site-graphs
Rules are symbolic representation of sets of reactions between bio-molecular species.We formalise the notion of rules, and we explain how to refine rules in order to tune their level of specificity.
When a site-graph G is transformed into another site-graph H, it is important to identify which agents of G correspond to which agents of H. Since some agents of G may disappear and some agents of H may be created during the transformation, we need to formalise a partial matching between the agents of the site-graphs G Feret, Ly and H.This partial matching is described by the means of a span of embeddings.For the sake of generality, we define firstly the notion of weak partial embeddings.

Definition 3.11 [(weak) partial embedding]
A weak partial embedding from a sitegraph L to a site-graph R with domain D, is a pair (h L , h R ) made of a weak embedding h L from the site-graph D to the site-graph L and a weak embedding h R from the site-graph D to the site-graph R.
A weak partial embedding from a site-graph L to a site-graph R with domain D is usually denoted as φ : In a weak partial embedding φ : R, the site-graph L (resp.R) is called the left hand side of (resp.the right hand side of) φ and is written lhs(φ) (resp.rhs(φ)).The domain D denotes a region that is shared between the sitegraphs L and R. The choice of the domain can be made modulo isomorphism.That is to say that a weak partial embedding φ = (h L , h R ) and a weak partial embedding (h L h, h R h), where h is an isomorphism from a site-graph to the domain of φ are considered to be equivalent.
A weak partial embedding is called a partial embedding when each of its two weak embeddings is an embedding.
Weak partial embeddings may be composed thanks to the pullback construction.
Definition 3.12 [composition] Let φ and φ be two weak partial embeddings such that rhs(φ) = lhs(φ ).We write: and: There necessarily exist a site-graph D 3 and a partial embedding: from the site-graph D 1 to the site-graph D 2 , such that: (ii) and for any other site-graph D 4 and any partial embedding φ :

Feret, Ly
With these notations, the weak partial embedding (h ) is called the composition of the weak partial embeddings φ and φ and is written as φ φ.
The composition of two weak partial embeddings is uniquely defined modulo the fact that the domain may be replaced by any isomorphic one.
A weak embedding h from a site-graph L to a site-graph R may be seen as the weak partial embedding (i L , h).Thus, we can compose a weak partial embedding and a weak embedding (provided that the right hand side of the weak partial embedding is equal to the domain of the weak embedding).We can also compose a weak embedding and a weak partial embedding (provided that the codomain of the embedding equal to the left hand side of the weak partial embedding).We notice that the composition of two partial embeddings is also a partial embedding.
A rule is a transformation between two site-graphs, a left hand side L and a right hand side R.In a rule, some agents and some sites are preserved.This is specified by a site-graph D which is embedded both into L and into R and which describes everything that is preserved.Not all transformations are allowed: one can remove and add agents, create links between free sites, and free pairs of sites that are connected.The agents that are created have to fully define the state of their sites.Our requirements are formalised in the following definition: we have: (h R (m), i) ∈ S R and for every site identifier i ∈ Σ lnk ag-st (type R (m)), we have: The constraint (i) ensures that D is a local greatest upper bound.The constraint (ii) ensures that when a site gets bound, we know to which site it is bound.The constraint (iii) ensures that the sites which occur both in the left hand side and in the right hand side of a rule, have a binding state in the left hand side of this rule if and only if they have a binding state in the right hand side of this rule.The constraint (iv) ensures that when an agent is created, the state of all its sites is defined.
A rule L c o o D 1 / / R is usually denoted as L , 2 R (leaving the two embeddings and the common region implicit).Moreover, we usually denote the left hand side L (resp. the right hand side R) of a rule r : L , 2 R as lhs(r) (resp.as rhs(r)).
Example 3.14 Examples of rules can be found in Figs. 1, 2, and 3.
Rules may be more or less refined [7,17], by adding more or less information about the context in which they may be applied.

Definition 3.15 [refinements]
A refinement (r, r , h L , h R ) is a tuple where r is a rule between two site-graphs L and R, r is a rule between two site-graphs L and R , h L is an embedding from the site-graph L to the site-graph L , and h R is an embedding from the site-graph R to the site-graph R such that: (i) h R r = r h L ; (ii) and for every rule r from the site-graph L to a site-graph R , and every embedding h R between the site-graph R to the site-graph R , such that the condition h R r = r h L is satisfied, there exists a unique weak embedding h from the site-graph R to the site-graph R such that both following conditions: The use of a homomorphism in Def.3.15 is required in case of side effects (that is to say if, in rules, some agents are degraded without specifying the state of all their binding sites, or if some instances of the symbol − is removed) (e.g.see [12]).
With the notations of Def.3.15, we say that there is a transition from the state L to the state R via a computation step with the label (r, h L ), and we write L (r,h L ) − −−− → R .Moreover, with the same notations, whenever the site-graph L is a chemical mixture, the site-graph R is a chemical mixture as well [6].
Example 3. 16 We give an example of a rule refinement in Fig. 7.
Refinements can also be used to specialise a rule, for the production of a given pattern [8,12].Given a rule L , 2 R and an embedding f from the right hand side R of the rule r to a site-graph R , we denote by right ref(L , 2 R, f ) the set of the refinements (r, r , h L , h R ) such that both r = L , 2 R and h R = f .If the rule L , 2 R induces no side-effects, then the set right ref(L , 2 R, f ) is singleton.Otherwise, it may contains several elements, according to which sites of the sitegraph R might have been released by side-effects (e.g.see [12], for an operational definition of right ref(L , 2 R, f )).

Orthogonal sets of patterns
Now that we have explained the operational semantics of Kappa.We introduce orthogonal sets of patterns as a means to abstract sets of chemical mixtures.
We denote as M the set of all the chemical mixtures that can be written with the signature Σ.We assume that we are given P a set of patterns of interest.The goal of our analysis is to detect which patterns of interest may occur in a mixture that is reachable from a given set of initial mixtures, after having applied zero, one, or several transition steps.The main idea is to abstract a set of chemical mixtures X ⊆ M by the subset Y of the patterns in P that may occur in at least one of the chemical mixtures in the set X. Thus, we define an abstraction function α P from the set ℘(M) to the set ℘(P), mapping each set X ⊆ M to the set {P ∈ P | ∃M ∈ X, P 1 / / M }.Intuitively, a subset Y ⊆ P denotes the set X ⊆ M of the mixtures such that any pattern in P that occurs in a mixture M ∈ X, belongs to the set Y .This relation may be by the means of a concretisation function: we define the concretisation function γ P from the set ℘(P) to the set ℘(M), that maps any subset Y ⊆ P to the set {M ∈ M | ∀P ∈ P, P 1 / / M ⇒ P ∈ Y }.
The functions α P and γ P satisfy the following property: for any set X ⊆ M and any set Y ⊆ P, α P (X) ⊆ Y if and only if X ⊆ γ P (Y ).Thus, the pair of functions (α P , γ P ) forms a Galois connexion [4,5] between the complete lattice ℘(M) and the complete lattice ℘(P).
In order to apply a rule in a specific context, we have to check whether it is possible to reach a mixture that contains an occurrence of the left hand side of the refinement of this rule, that specialises this rule to this particular context.This comes down to check whether a given pattern occurs in a reachable mixture.Thus, we propose to use our abstraction to answer to this question approximately.This is the purpose of the following definition.It is worth noticing that the choice of the set P of the patterns of interest is crucial.It impacts the accuracy of our abstraction, that is to say our capability to express the fact that a given pattern is reachable, or not.But it also impacts the efficiency of our analysis: according to the set of patterns of interest, it may be more or less costly to compute whether the relation Y |= P P is satisfied for a given subset Y of patterns of interest and for a given pattern P .We address this issue by restricting the set of potential sets of patterns of interest, in order to provide them with a more convenient algebraic structure.We also approximate even further the computation of the relation |= P .
We propose to use orthogonal sets of patterns.Orthogonal sets of patterns have been introduced in [14].They allow us to partition sets of embeddings from a given pattern to arbitrary site-graphs, according to some contextual information.We give as follows the abstract definition of an orthogonal set of patterns.As an abuse of notation, for each kind of protein A ∈ Σ ag , we denote by A the site-graph that has only one agent of type A, and no site.

Definition 4.2 [orthogonal set of patterns] Let
A be an agent type in Σ ag .An orthogonal set of patterns for the agent type A is a set O of pairs (P, f ) where P is a site-graph and f is an embedding from the site-graph A to the site-graph P , such that for any embedding f from the site-graph A to a mixture M ∈ M, there exists a unique pair (P, f ) ∈ O and a unique embedding f : P 1 / / M , satisfying f = f f .Example 4.3 We now give examples of orthogonal sets of patterns.In these example, we keep the embeddings implicit, since each pattern contains only one instance of the protein being refined.In Fig. 8, we give an example of orthogonal set of patterns for the protein B of the first case study (see Fig. 1).In Fig. 9, we give an example of orthogonal set of patterns for the protein D of the second case study (see Fig. 2).In Figs. 10 and 11, we give examples of orthogonal set of patterns for the proteins F and G of the third case study (see Fig. 3).There is no need to provide a set of patterns for the protein E in the second case study since the fact that the proteins D and E are located in the same compartment is equivalent to the fact that the proteins E and D are located in the same compartment.This is not the case in the third case study, since, indeed, the fact that each instance of the protein F that is bound twice is bound to the same instance of the protein G does not imply that each instance of the protein G that is bound twice is bound to the same instance of the protein F .Finite orthogonal sets of patterns for a given agent A can be defined inductively.We start with the set that contains only the pair made of the site-graph A and

Feret, Ly
Now we define our relaxed abstract satisfaction procedure in the case where the abstract domain is an orthogonal set of patterns.Definition 4.5 [over-approximated abstract satisfaction] Let A be an agent type in Σ ag and O be an orthogonal set of patterns for the agent type A. Let Y be a subset of O. Let P be a pattern.
We write Y |= O P , if and only if, for any agent n ∈ A P , such that type P (n) = A, there exists a pair (P , f ) ∈ Y and a gluing (G ∩ , ψ, ψ , φ, φ , G ∪ ) between the sitegraph P and the site-graph P such that the following conditions are satisfied: (i) G ∩ = A; (ii) ψ maps the unique agent of the site-graph A to the agent n in the pattern P ; (iii) ψ = f .
Let us give some intuition about the procedure in Def.4.5.Since an orthogonal set of patterns for a protein A defines a partition of the potential context for this protein, the idea is to check for each instance of the protein A in the pattern P , that its context is compatible with at least one authorised pattern P ∈ Y .The compatibility is ensured by the existence of a gluing between the two patterns P and P .
The relation |= is a conservative approximation of the relation |= as formalised in Prop.4.6.Given O an orthogonal set of patterns and Y a subset of O, we denote as Ŷ the set of patterns {P | ∃f, (P, f ) ∈ Y }.Proposition 4.6 Let A be an agent type in Σ ag and O be an orthogonal refinement for the agent type A. Let Y be a subset of O. Let P be a pattern.
The following implication is satisfied:

Reachability analysis
In this section, we use orthogonal sets of patterns to abstract the set of reachable mixtures in a given Kappa model.
We assume that we are given a set X 0 ⊆ M of initial mixtures and a set of rules R. We are interested in the set of mixtures that can be obtained starting from an initial one, after zero, one, or several transition steps using the rules in the set R. The set of reachable mixtures can also be defined as the least fix-point of the monotonic operator F over sets of mixtures, that is defined as follows: The operator F is monotonic map over a complete lattice.Thus, by [18], it admits a least fix-point.The least fix-point lfp F of F is indeed the set of reachable mixtures.Yet, the computation of this fix-point can be costly, or could even not terminate in case of polymerisation.Thus we are going to abstract this computation, by the Feret, Ly means of a collection of orthogonal sets of patterns.Formally, we assume that we are given a set D of orthogonal sets of pattern for the agent type A. We denote as P the set of all the patterns that occurs in these sets, that is to say that P ∆ = { Ô | O ∈ D}.The set P contains the patterns of interest.We are going to abstract sets of mixtures, by subsets of the set P. The set ℘(M) and the set ℘(P) are related by the Galois connexion (α P , γ P ) that we have introduced in Sect. 4.
The computation of the least fix-point of the function F can be done in the abstract, by the means of the Galois connexion (α P , γ P ).By [5], the function α P • F • γ P is the best abstract counterpart to the function F. In particular, the function α P • F • γ P is monotonic and its least fix-point satisfies the inclusion: Now we derive an explicit definition of the function α P • F • γ P .When applied to a subset Y ⊆ P of patterns, the function α P • F • γ P computes the set of patterns in P that can occur in a mixture M that is reachable in one step of computation from a mixture M ∈ γ P (Y ), and adds them to the set Y .For each pattern P ∈ [α P • F • γ P ](Y ) \ Y , there exists necessarily a refinement r of a given rule r, such that the right hand side of the rule r is a gluing between the pattern P and the right hand side of the original rule r.
It follows that the following proposition holds.The computation of the least fix-point of the function α P • F • γ P can be costly because of the cost of deciding whether the relation |= P is satisfied, or not.We propose this decision procedure by a more abstract one, using the fact that we use a collection of orthogonal sets of patterns.We define the function F as follows: .
The function F is monotonic.Thus, by [18], it has a least fix-point.Moreover, the function F satisfies the inclusion α P • F • γ P (Y ) ⊆ F (Y ), for any subset Y ⊆ P. By [4,5], it follows that lfp α P • F • γ P ⊆ lfp F , which ensures that lfp F ⊆ γ P (lfp F ).
Example 5.2 We give in Fig. 13, the result of the fix-point iteration of the function F on the model of Fig. 1, starting with an arbitrary amount of the proteins A, B , and C , with all their binding sites free, and none of their property sites phosphorylated.We have used as an abstract domain the orthogonal set of patterns that is given in Fig. 8.Our analysis has succeeded in proving that whenever the protein B is phosphorylated then its site l is bound, and that whenever its site r is bound, then it is phosphorylated.Even it our analysis is approximated, this is a definite answer: if a pattern is found by the analysis, it may be not reachable; but if a pattern does not occur in the result of the analysis, then it cannot occur in any reachable mixtures.
We give in Fig. 14, the result of the fix-point iteration of the function F on the model of Fig. 2, starting with an arbitrary amount of the proteins D and E , with all their binding sites free, and all located in the cytoplasm.We have used as an abstract domain the orthogonal set of patterns that is given in Fig. 9. Our analysis has succeeded in proving that whenever two proteins D and E form a dimer, then their site l takes the same value, that is to say that they are located in the same compartment.
We give in Figs. 15 and 16, the result of the fix-point iteration of the function F on the model of Fig. 3, starting with an arbitrary amount of proteins F and G, with all their binding sites free.We have used as an abstract domain the orthogonal set of patterns that is given in Figs. 10 and 11.Our analysis has succeeded in proving that whenever a protein has its two binding sites occupied, then it is necessarily bound twice to the same instance of a protein.
Our framework is integrated within an open source static analyser for Kappa models [3].A pre-analysis is used to select the set of orthogonal sets of patterns of interest.Basically, we focus on three kinds of properties:     (i) we are interested in the relationships among the states of each set of sites that occur together in a given instance of a protein in a given rule; (ii) we are interested in the relationships between the state of each pair of sites that occur in two instances of proteins bound together in the right hand side of a given rule; (iii) we are interested, for each pair of proteins that can be bound together in several ways, whether one instance of one of this protein may be bound to two instances of the other one simultaneously.
We have tested our analyser on three models of various sizes.The first model describes the early events in the integration of the epidermic growth factor.It is inspired by the model that is described in [1].The second and the third models are in development.The second model is a model of the extra-cellular matrix of the TGFβ by Nathalie Théret and Jean Cocquet, and the third one is a model of the Wnt signalling system by Héctor Francisco Medina Abarca.In the first model, the analysis has detected all the relational information about the states of sites in each protein of the model.It has also proved that a given instance of a receptor cannot be bound to two different instances of a receptor simultaneously.For the other models, the result of the analysis is used by the modellers to check the consistency of their model after each update, as well as to identify what is missing in the current version of their models.
In this paper, we have developed an abstract interpretation static analysis to capture invariants about the biochemical structure of bio-molecular species that may be reachable in a given model.Our framework extends our previous works in local views [9,13].Our analysis not only can infer relationships among the state of the sites of a same protein, but also it can detect relationships between the sites of several proteins of a given molecular species.
We have provided a generic analysis that, given a set of patterns of interest, detects which one may not arise when executing the model.So as to ensure the scalability of our approach we have specialised our framework to the case of orthogonal sets of patterns that can be used to partition the set of the occurrences of a given agent according to some contextual information.Orthogonal sets of patterns offer a nice trade-off between accuracy, efficiency, and expressiveness.We have used them not only to reformulate our previous analysis for local views, but also to detect properties related to the transport of molecules and to the formation of double bonds between proteins.We have shown the scalability of our approach on two models in development.
For future works, we would like to use weakly relation domains [16], to infer the kind of non local properties that are involved in the formation of macro-molecules.More precisely, we would like an analysis that computes the pair of proteins conformations that may occur simultaneously in a given molecular species.

Fig. 1 .
Fig. 1.Rules for a model with relationships among the sites of proteins.

Fig. 2 .
Fig. 2. Rules for a model with molecules transportation.
(a) First binding on sites x.(b) Second binding on sites x.(c) First binding on sites y.(d) Second binding on sites y.(e) Sites x unbinding.(f) Sites y unbinding.

Fig. 3 .
Fig. 3. Rules for a model with double binding.
vii) for every triple (φ G , φ H , G ∪ ) such that the following three conditions: (a) φ G is an embedding from the site-graph G to the site-graph G ∪ , (b) φ H is an embedding from the site-graph H to the site-graph G ∪ , (c) φ G ψ G = φ H ψ H , are satisfied, there exists a unique homomorphism h such that the following constraints:φ G = hφ G , φ H = hφ Hare both satisfied as well.

Definition 4 . 1 [
abstract satisfaction] Given a subset Y ⊆ P of patterns of interest, and P a pattern, we write Y |= P P if and only if there exists a mixture M ∈ γ P (Y ) such that P 1 / / M .

Fig. 8 .
Fig. 8. Orthogonal set of patterns for the protein B.

Fig. 9 .
Fig. 9. Orthogonal set of patterns for the protein D.

Fig. 10 .
Fig. 10.Orthogonal set of patterns for the protein F .

Fig. 11 .
Fig. 11.Orthogonal set of patterns for the protein G.

Proposition 5 . 1
Let Y be a subset of the set P. The set [α P • F • γ P ](Y ) is exactly the set of the patterns P ∈ P such that at least one of the following conditions is satisfied:(i) P ∈ Y , (ii) there exist a rule r ∈ R, a glueing (G ∩ , ψ 1 , ψ 2 , φ 1 , φ 2 , G ∪ )between the right hand side rhs(r) of the rule r and the pattern P , and a right refinement (r, r , h L , h R ) ∈ right ref(r, φ 1 ) of the rule r along the embedding φ 1 such that: Y |= P lhs(r ).

Fig. 13 .
Fig. 13.Analysis result for the protein B.

Fig. 14 .
Fig. 14.Analysis result for the protein D.

Fig. 15 .
Fig. 15.Analysis result for the protein F .

Fig. 16 .
Fig. 16.Analysis result for the protein G.

Fig. 17
Fig.17.Benchmarks: The analyses have been performed on a Dell Lattitude E6430s with 8Gb of memory with a four core Intel Core i7-3540M CPU @ 3.00GHz processor.
Fig.17.Benchmarks: The analyses have been performed on a Dell Lattitude E6430s with 8Gb of memory with a four core Intel Core i7-3540M CPU @ 3.00GHz processor.