RetroPath2.0: A retrosynthesis workflow for metabolic engineers.

Synthetic biology applied to industrial biotechnology is transforming the way we produce chemicals. However, despite advances in the scale and scope of metabolic engineering, the research and development process still remains costly. In order to expand the chemical repertoire for the production of next generation compounds, a major engineering biology effort is required in the development of novel design tools that target chemical diversity through rapid and predictable protocols. Addressing that goal involves retrosynthesis approaches that explore the chemical biosynthetic space. However, the complexity associated with the large combinatorial retrosynthesis design space has often been recognized as the main challenge hindering the approach. Here, we provide RetroPath2.0, an automated open source workflow for retrosynthesis based on generalized reaction rules that perform the retrosynthesis search from chassis to target through an efficient and well-controlled protocol. Its easiness of use and the versatility of its applications make this tool a valuable addition to the biological engineer bench desk. We show through several examples the application of the workflow to biotechnological relevant problems, including the identification of alternative biosynthetic routes through enzyme promiscuity or the development of biosensors. We demonstrate in that way the ability of the workflow to streamline retrosynthesis pathway design and its major role in reshaping the design, build, test and learn pipeline by driving the process toward the objective of optimizing bioproduction. The RetroPath2.0 workflow is built using tools developed by the bioinformatics and cheminformatics community, because it is open source we anticipate community contributions will likely expand further the features of the workflow.


Introduction
Despite the increasing number of small molecules that are bioproduced, the research and development process (R&D) is still costly and rather slow. For instance, the metabolic engineering of artemisinic acid is claimed to have taken more than 130 person-years and about 10 years to complete (Paddon et al., 2013;Keasling, 2014). Among the challenges that industrial biotechnology is facing to deliver sustainable solutions are 1) the reduction of R&D costs and 2) the bioproduction of a wider palette of compounds. To address these challenges, computational/experimental strategies where alternative metabolic pathways are first designed and assessed before being built and tested have been proposed (see reviews (Medema et al., 2012;Copeland et al., 2012;Hadadi and Hatzimanikatis, 2015;Lee and Kim, 2015)). While some computationally-driven strategies make use of known metabolic reactions albeit not necessarily in the same species (Rodrigo et al., 2008;Moriya et al., 2010) others allow to design pathways that encompass novel reactions not stored in metabolic databases. These latter tools make use of retrosynthesis algorithms (Marchant et al., 2008;Moriya et al., 2010;Henry et al., 2010;Carbonell et al., 2011bCarbonell et al., , 2014aYim et al., 2011;Liu et al., 2014;Campodonico et al., 2014;Hadadi et al., 2016a).
Retrosynthesis algorithms take as input a set of metabolites, for instance the metabolites in a growth medium or the metabolites of a chassis strain model, and the set of target compounds to bioproduce. Ideally the target compounds could be any molecule in the chemical space. The algorithms generate retrosynthesis networks linking the target compound(s) (the source) to the metabolites of the chassis strain (the sink) through reactions.
Such retrosynthesis networks should be further processed to map or extract information relevant for the biological application. For instance, some algorithms can be applied to enumerate pathways  and rank them based on several criteria including enzyme availability and performance, product and intermediate compound toxicities (Planson et al., 2012) or the theoretical yield of the desired compound (Campodonico et al., 2014;Carbonell et al., 2014b;Cho et al., 2010;Liu et al., 2014). Interestingly, retrosynthesis networks exploitation is not strictly limited to retrosynthesis. Applications have been proposed to predict biodegradation routes (Hou et al., 2004;Oh et al., 2007;Finley et al., 2009) in order to identify unknown compounds from the underground metabolism (Jeffryes et al., 2015), to predict the transitions of labelled atoms in metabolic networks (Arita, 2003;Hadadi et al., 2016b), and to design biosensing circuits for compounds for which no direct biosensors are known (Delépine et al., 2016). The main difference of the aforementioned applications lies in the definition of source and sink compounds sets; the current paper focuses on retrosynthesis but our solutions still stand for other applications requiring reaction network generation.
One issue users of retrosynthesis-based solutions are facing is that algorithms and underlying data have not been fully documented and released. In most cases, authors provided fine-tuned webservers (Campodonico et al., 2014;Carbonell et al., 2014b;Jeffryes et al., 2015;Hadadi et al., 2016a) often filled with pre-generated data that focuses on some exemplar cases. Based on this information, it is difficult for users to grasp methods' limitations, to improve them, or to exploit them for different uses. At a time when open-data principles gain more and more traction (Schofield et al., 2009;McNutt et al., 2016;Haug et al., 2017) we believe this lack of flexibility should be overcome.
In this spirit, we developed the RetroPath2.0 workflow on the KNIME analytics platform (Berthold et al., 2008) to answer the need for a modular and easy-to-use tool to predict reaction networks. Workflows have several advantages over scripting languages. A graphical user interface allows for rapid test and prototyping, even for users with little to no knowledge in programing. For instance, parallelization of tasks is inferred from workflow topology and does not need any special library or technical knowledge from the user. Once configured, workflows are readily deployable on all platforms where KNIME can be installed. KNIME workflows are popular in cheminformatics to prepare and analyse data, as shown by the number of extensions maintained by users in this field (Berthold et al., 2008;Warr, 2012). Thus, metabolic engineers benefit from a large panel of tools to analyse the chemical diversity and features of their data. As a matter of fact, RetroPath2.0 was developed using only community tools. We foresee it will make the workflow easier to modify and at the very least a good proof of concept of what can be done with workflows.
The current paper provides for the first time a simple workflow encompassing the main steps of the retrosynthesis process. We hereby review the main steps of retrosynthesis algorithms in order to demystify their use and shed light on the shortcomings of current tools (Marchant et al., 2008;Moriya et al., 2010;Henry et al., 2010;Yim et al., 2011;Liu et al., 2014;Carbonell et al., 2014a;Campodonico et al., 2014;Hadadi et al., 2016a). We then outline our proposed solution through several applications in metabolic engineering and biosensor engineering. Ret-roPath2.0 is available at myExperiment.org (https://www. myexperiment.org/workflows/4987.html) along a set of reaction rules and some classic metabolic engineering examples to test RetroPath2.0 features.

Encoding reactions as reaction rules
The first challenge that retrosynthesis algorithms have to address is linked to the way reactions are encoded. Most retrosynthesis algorithms are based on reaction rules, but other strategies exist to encode reactions (Kayala et al., 2011;Latino and Aires-de-Sousa, 2011). A reaction rule generally depicts the change in bonding patterns when transforming a set of substrates (reactants) into a set of products. For retrosynthesis applications, rules are reversed such that one computes the substrates from the products.
Several solutions have been proposed to code for reaction rules, namely Bond-Electron (BE) matrices (Dugundji and Ugi, 1973), reaction SMARTS (Daylight, 2017), RDM patterns (Oh et al., 2007), and reaction signatures (Carbonell et al., 2013). Examples of coding systems are illustrated in Fig. 1. We highlight below some key concepts to understanding reaction rule encoding in a retrosynthesis context.  Henry et al. (2010) and Yim et al., (2011). These are the only rules with EC number 2.6.1. In both cases the rules are represented by SMARTS strings. D. The reaction signature rule was computed using the MolSig package, d represents the signature diameter. See Carbonell et al. (2013) for definition and examples of signatures.

Enzymatic promiscuity
Reactions for retrosynthesis applications should be modelled with a controlled degree of generalization for their substrates and products. Indeed, reaction rules containing a full description of substrates and products chemical structures cannot be applied on new compounds. This is the case for classic metabolic models and database and their lack of generalization prohibits the generation of novel pathways. The use of generalized chemical transformations is required in order to be able to predict new metabolic transformations. Such predictions are necessary since reaction databases are not complete (Altman et al., 2013;Chang et al., 2015) and side enzymatic activities are often underestimated.
This lack of knowledge on alternative enzymatic activities is currently a critical limiting factor for metabolic engineering since it has been estimated that 37% of E. coli K12 enzymes have a promiscuous activity for other substrates structurally similar to their main known substrate (Nam et al., 2012). In order to be able to generate new metabolic transformations (and new compounds) one thus needs to use generalized reactions to model enzymatic promiscuity, i.e. rules that can be applied to different substrates, and eventually on compounds absent from the databases. For instance, BNICE (Henry et al., 2010;Jeffryes et al., 2015;Hadadi et al., 2016a) and SimPheny (Yim et al., 2011) use a collection of reaction rules that, as depicted in Fig. 1, can be applied to any ketones (including oxaloacetate) since their encoding is focused on the reaction centre.

Identification of the reaction centre
The simplest way of controlling a degree of abstraction for reaction substrates is to encode the reactions around its centre. This requires compiling the list of atoms that belong to the reaction centre, i.e. atoms that change their configuration when the reaction is applied (panel B in Fig. 2). Atoms changing configuration are those attached to bonds that are broken, formed, or are changing order, as well atoms for which charge and stereochemistry is changing when the reaction is taking place.
Reaction rules used in retrosynthesis generally require a solved Atom-Atom Mapping (AAM, see panel A in Fig. 2) between the atoms of the substrates and those of the products to identify the reaction centre of the reaction (Hou et al., 2003;Hatzimanikatis et al., 2005;Oh et al., 2007;Cho et al., 2010;Liu et al., 2014). The AAM problem is equivalent to the Maximum Common Substructure, or the subgraph isomorphism problem which turns out to be NP-hard (Chen et al., 2013). Avoiding the use of AAM to generate rules is nevertheless possible in some cases, as it was originally shown by a previous version of the RetroPath algorithm based on fingerprint subtraction (Carbonell et al., 2014a) (see Fig. 1).
Importantly, if encoding the reacting centre is necessary, it may not be sufficient to properly define a reaction catalysed by an enzyme since other atoms far from the reacting centre could be involved in the ligand binding as well. To palliate this problem, the definition of the reacting centre is extended to neighbour atoms, either systematically at a predefined bond-distance (diameter, panel C and D Fig. 2) or based on expert-knowledge.

Systematic rule generation
Reaction rules can be computed in principle by processing the set of reactions stored in metabolic databases. However there are some Rules are encoded as reaction SMARTS and characterized by their diameter (∞ purple, 4 blue or 0 green), that is the number of bonds around the reaction centre (atoms 6, 10 and 14, 19) defining the atoms kept in the rule. This allow for a controlled and flexible modelling of enzymatic promiscuity. Note that for the case of 2.6.1.1 the co-product is always the same (C: L-aspartate; D: 2oxoglutarate) but that is not always the case, depending on the connectivity of the atoms belonging to the reaction centre. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) difficulties associated with this task. Exhaustive rules generation is certainly another major challenge for retrosynthesis. We can distinguish two main philosophies to systematically encode enzymatic reactions. The first approach consists in encoding a small set of generalist rules guaranteed by a model to cover all possible classes of reactions. According to the Enzyme Commission (EC) nomenclature all reactions that belong to the same third level EC number should follow the same chemistry, while the fourth and last level is for disambiguation (International Union of Biochemistry and Molecular Biology Nomenclature Committee and Webb, E.C, 1992). Both SimPheny and BNICE use the third EC number level to guide their reaction encoding effort. SimPheny (Yim et al., 2011) has 50 manually curated reaction rules, and the number of rules of BNICE systems are of the same order; 86 for (Henry et al., 2010), 198 for (Jeffryes et al., 2015), 722 for (Hadadi et al., 2016a). This approach is well-suited for manual curation, since even if the number of reactions to annotate is rather small, it is supposed to be exhaustive in terms of the involved chemistry. Nonetheless, relying on EC numbers often requires adding exceptions since some reactions at the third level of EC numbers do not share any common substructure and thus cannot be expressed by the same rule. For instance, the carbon-halide lyases class (EC 4.5.1.*) is composed of five fourth level reactions which all remove a chlorine atom, but some reactions also remove a primary amine from a substrate and replace it either by a double bonded carbon, a hydrogen, an oxygen atom or a more complex functional group ( Supplementary Fig. S1). Their number of substrates and products also varies. Clearly, these reactions cannot be encoded using a single BE matrix, a reaction signature, or an intelligible reaction SMARTS. Another need for exceptions arises from the fact that many reactions have no EC number assigned by the Commission (International Union of Biochemistry and Molecular Biology Nomenclature Committee and Webb, E.C, 1992).
The second approach, which is more data-driven, is to automatically compute rules for all available metabolic reactions by selecting only the atoms belonging to a sphere of fixed diameter around the reaction centre. This is the approach adopted by the workflow proposed in this paper, RetroPath (Carbonell et al., 2011b(Carbonell et al., , 2014b, and others (Chen et al., 2013;Rahman et al., 2014;Sivakumar et al., 2016). Ideally, the diameter used should directly be linked to known promiscuity of an enzyme's sequence. In our experience, a diameter of 6-8 (see Supplementary Note 1 for a detailed discussion on diameter selection and promiscuity) is generally a good trade-off to cover known reactions' specificity with a reasonable amount of promiscuity predictions (see Section 4.1.3 for an evaluation of rules performance for promiscuity classification and (Carbonell et al., 2011a;Faulon et al., 2008)). Using the procedure outlined in the caption of Fig. 2, when applied to the MetaNetX database (Moretti et al., 2016) the number of rules returned is between 6900 and 19,000 depending on the parameters used to model enzymatic promiscuity (diameter) for the 31,527 reactions stored in MetaNetX (MNXR identifiers, v.2.0). Interestingly, not only multiple generated rules can belong to the same EC class, but also a same rule can correspond to several EC classes. For instance, at diameter 4, three EC numbers (2.6.1.1, 2.6.1.17, 2.6.1.67) from three distinct reactions (resp. MNXR32641, MNXR32641, MNXR31792) are associated to the same rule depicted in Fig. 1D (promiscuity on oxaloacetate, MNXM42).

Cosubstrates, cofactors and coproducts
Another challenge for retrosynthesis algorithms is the need to handle reactions processing multiple substrates and/or multiple products. Dealing with multi-substrate reactions requires more computational resources in order to model enzymatic promiscuity for each combination of promiscuous substrates (Fig. 2).
For these purposes, cosubstrates and coproducts that are currency cofactors (such as water, CO 2 , ATP, NADP, etc.) can be ignored from the rules under the assumptions that they are available in the cell and that there is no gain for retrosynthesis analysis in modelling promiscuity on them. However, information about cofactors participating in reactions should not be discarded since they could be used at a later stage to sort pathways by their efficiency in terms of cofactor exchange and the burden they impose on central metabolism.
Nonetheless, even if we ignore currency metabolites in the rules, around a third of metabolic reactions still remains multimoleculars (see Supplementary Note 2). Our practical solution is to model enzymatic promiscuity for only one substrate at a time, meaning that for any multi-substrate reaction "A + B → C + D", alternatives substrates A′ and B′ are never tested together to limit the combinatorial complexity. RetroPath2.0 follows this solution as we encode one rule per reference substrate (for components "A → C + D" and "B → C + D") as shown in Fig. 2. Others embrace combinatorial complexity ( Fig. 1B and D) or simply ignore all cosubstrates (Fig. 1C).

Building (retrosynthesis) reaction network
In all algorithms listed in Table 1, retrosynthesis maps are constructed by applying reaction rules in an iterative fashion starting from a source set of compounds until the molecules in a sink set of compounds are found in the map. In the context of metabolic engineering, if the rules are applied in a forward manner, the source set is composed of the native metabolites of the chassis strain and the sink set are the molecule we wish to produce. If the rules are applied in a reverse manner then the source set are the molecules to be produced and the sink set are the metabolites of the chassis. One bottleneck that all algorithms face is computation complexity due to the combinatorial explosion of the number of reactions predicted from the rules. This is true regardless of whether the reactions are applied in a forward or reverse manner. As an example, let us assume we wish to perform retrosynthesis for some FDA approved drugs in E. coli. In the reaction list we have at our disposal there is one for reversed hydro-lyases (i.e. reversed 4.2.1). According to (Henry et al., 2010) the rule for that reversed re- , where R5 can be C, H, O, and S and all other Rs can be any atoms. Assuming R1C(=O)C(R2)=C(R3)R4 is the main substrate (our drug target) and O-R5 the cosubstrate, 68 FDA approved drugs from DrugBank contain the first substructure. If we restrict the cosubstrate to be in the E. coli model iJO1366 then 653 metabolites out of 810 compounds in the model contain the second substructure, while 50,810 compounds from MetaNetX will pass the substructure test. Taking Vitamin C as an example of a DrugBank compound that passes the substructure filter, one finds 1883 unique products when applying the reversed rule 4.2.1 to Vitamin C and E. coli metabolites and 343,177 products when the cosubstrate is in MetaNetX. There are more products than substrates because for some substrates the reversed rule 4.2.1 applies to more than one location.
As already mentioned, for a given retrosynthesis target one needs to apply all rules to the target, all rules to the products obtained by application of the reversed reactions to the target, and so on until a predefined stop condition occurs (often the number of iterations). Clearly, if reaction rules generate more than 1000 products even with 50 rules the problem starts to be challenging -if not impossible-to manage computationally after 2 or 3 iteration steps.
Strategies are needed in order to cope with that complexity. RetroPath proposes a solution where reactions are scored according to their ability to retrieve enzyme sequences catalysing substrate to product transformations. Reactions below a predefined score are removed from the retrosynthesis map. For any given reaction the sequence scores are computed by machine learning using a technique that we developed earlier. The model is trained on all known pair "enzyme sequence" x "(substrate, product)" using Support Vector Machines (Faulon et al., 2008) or Gaussian Processes (Mellor et al., 2016). GEM-Path (Campodonico et al., 2014) proposes another strategy where for each reaction the substrates are accepted if they are similar enough to the substrates of the reference reactions.
We detail in the next sections a new implementation of RetroPath to predict reaction networks and perform retrosynthesis among other applications. RetroPath2.0 addresses the challenges listed above with a special attention to remaining easy to use and modifiable by end users, unlike tools developed so far. In that sense, both the encoding of reactions into generalized rules and the actual use of those rules to predict new reactions depend strictly on resources developed by the community.

Method
The workflow proposed in this section can a priori be used to all systems presented in Table 1 to construct retrosynthesis maps as long as reaction rules can be coded by reaction SMARTS. As examples, we provide such set of rules extracted from the SimPheny, BNICE and RetroPath systems.
Our computational methods make use of in-house algorithms (Carbonell et al., 2014a), RDKit routines (Landrum, 2016) and KNIME nodes (Berthold et al., 2008). They have been implemented in the form of a KNIME workflow -called RetroPath2.0- (Fig. 3) that we provide in the Supplementary materials, in addition of sets of rules, examples, and useful data files. Updates will be published on myExperiment.org (https://www.myexperiment.org/workflows/4987.html).

Reaction rules
RetroPath2.0 uses reaction SMARTS to encode reactions. It is a SMIRK-like reaction rules (Daylight, 2017) format defined by RDKit and is mostly compatible with other tools (see Fig. 1).

Collected reaction rules
Rules for SimPheny and BNICE were extracted from Yim et al. (2011) and Henry et al., (2010) respectively, and manually entered by using Chemaxon Marvin Sketch software products (15.5.18, 2015, http://www.chemaxon.com). For each rule atom mapping was calculated by Marvin Sketch and the resulting rules were stored in SMARTS format as in Fig. 2.

Generated reaction rules
We used MetaNetX version 2.0 (Moretti et al., 2016) as the reference database for metabolic reactions that we encoded in rules. MetaNetX is a meta-database that compiles into a single reference namespace both reactions and metabolites extracted from main metabolic databases such as KEGG, Metacyc, Rhea or Reactome. Reactions can contain many substrates and many products. We performed an Atom-Atom Mapping (AAM) using the tool developed by (Rahman et al., 2016) on all Me-taNetX reactions ( Fig. 2A). We filtered out transports reactions and those involving compounds with incomplete structures (class of compounds, R-groups, etc.). Stereochemistry was removed.
Multiple substrates reactions were decomposed into components (panel C and D in Fig. 2). There are as many components as there are substrates and each component gives the transformation between one substrate and the products. Each product must contain at least one atom from the substrate according to the AAM. This strategy enforces that only one substrate can differ at a time from the substrates of the reference reaction when applying the rule (see Section 2.1.1 enzymatic promiscuity modelling).
Next step consisted in computing reactions rules as reaction SMARTS for each component. We did it for diameters 2 to 16 around the reaction centre (panels C and D in Fig. 2) by removing from the Fig. 3. RetroPath2.0 KNIME workflow. A. Main panel view (left) and input configuration window (right) that allow the user to set up parameters. B. Inner view of the "Core" node where the computation takes place. The "Source, Sink…" and "Rules" nodes parse the source, sink and rules input files provided by the user and sanitize data so that it can be processed by downstream nodes. The outer loop ("Source" loop) iterates over each source compounds, while the inner loop ("Length" loop) allows to iterate the process up to a maximum number of steps predefined by the user. The nodes (i) "FIRE", (ii) "PARSE", (iii) "UPDATE SOURCE…" and (iv) "BUILD" are sequentially executed at each inner iteration. Respectively, they (i) apply all the rules on source compounds, (ii) parse and sanitize new products, (iii) update the lists of source and sink compounds for the next iteration and (iv) merge results that will be written by the node "Write global results". Once the maximum number of steps is reached (or no new product is found), the "Compute scope" node identify the scope linking each source to the sink compounds, then these results are written by the node "Write per source results". Only the main nodes involved in the process are shown. reaction components all atoms that were not in the spheres around the reaction centre atoms.
We extracted more than 24,000 reaction components from MetaNetX reactions, each one of those leading to a rule at each diameter (from 2 to 16).
We provide in Supplementary a subset of 14,300 rules for E. coli metabolism, both in direct and reversed direction. The rules were selected based on the MetaNetX binding to external databases and the iJO1366 whole-cell E. coli metabolic model (Orth et al., 2011).
3.2. Building (retrosynthesis) reaction networks between two pools of compounds using the RetroPath2.0 workflow The RetroPath2.0 workflow essentially follows an algorithm proposed by some of us (Carbonell et al., 2011b). After removing all source compounds already in the sink set, the workflow applies the rules to each of the compounds of the source set. For each compound, the products are computed using the RDKit KNIME nodes (Landrum, 2016). Products are standardised and duplicates are merged. All pairs substrate-product are added to the growing network along with the reaction rules linking them.
In the next iteration, the set of products becomes the new source set. However, before iterating, the workflow removes from the new source set all compounds that belong to the sink (as these are already solutions and there is no need to iterate) and the workflow adds the product set to the sink in order to avoid applying reactions on the same products during subsequent iterations. Consequently, the workflow computes only the minimal routes between source and sink, i.e. routes in which all reactions are essential for their viability, and thus minimizes the number of enzymes to be added to a chassis strain when implementing the pathways. This feature can be ignored by not specifying a sink for the first iteration.
The RetroPath2.0 workflow iterates until a predefined number of iterations is reached or until the source set is empty. The final produced graph is composed of the list of links between substrates and products annotated with their corresponding reaction rule. Products belonging to the sink are annotated as such.
Note that the iterative process can reveal itself to be quite computationally demanding. To tackle this issue, RetroPath2.0 has a feature to bias the reaction space exploration toward compounds generated by trusted rules, using a rule-wise penalty score. If too many compounds are generated and need to be handled at once, only a predefined number of compounds with the lowest penalties according to their generating rules are kept in the new-source of the following iteration. Of course, both the definition of the penalty and maximum number of compounds to keep are critical and fall within the responsibility of the user. As described next, the rules we provide are scored to optimize in vivo pathway feasibility by penalizing rules associated to enzymatic reactions with inconsistent sequence annotation.

Score rules by enzyme sequence consistency
Predicted reactions in the final graph generated by the RetroPath2.0 retrosynthesis workflow need to be associated with enzyme sequences in the final engineering of the pathways. The selection of such sequences should look for a trade-off between the specificity of the reaction rule and the information available in enzyme databases for the reaction through the EC classification. Whereas the EC classification has traditionally provided a hierarchical numerical classification of enzyme-catalysed reactions to progressively describe reactions in finer detail, RetroPath2.0 introduces a similar hierarchical classification that is controlled by the diameter used in rule generation. In some cases the diameter of the reaction rule found by the RetroPath2.0 workflow might be high, i.e. highly specific to that reaction. However, it often occurs that there is no annotated enzyme sequence for the rule. In order to find some candidate sequences, we look into reactions that are close according to the EC hierarchy for each EC class containing at least one instance of the rule at given diameter. Traversing both rules diameter hierarchy and the underlying EC classes allows the selection of plausible sequence candidates for each reaction rule.
We compiled the set of Uniprot sequence identifiers annotated for reactions by looking at the cross-link annotations in MetaNetX for Rhea and MetaCyc databases. In total 208,980 sequences from 5388 organisms were associated to 7793 reactions. At a given diameter of the rule, we iteratively assigned sequences to rules. First, reactions with annotated sequences were collected for each generated rule. Since a rule can represent one or more reactions at a given diameter, sequences coming from different reactions sharing the same rule were aggregated into a single set for that rule. These direct annotations only provided a partial coverage for the total rules in the database. For instance, at diameter d = 8, there were 7898 orphan rules, i.e. rules that were generated from reactions lacking sequence annotation (Supplementary Table 1). Similarly, there were 6280 orphan reactions at diameter d = 8. In order to increase the coverage, we considered the EC class of the reaction when such information was available. Sequences associated with reactions sharing strictly the same EC class were combined together. Adding together such annotations for the same EC class fixed issues related to partial annotations for the less common reactions. In that way, the number of orphan rules was significantly reduced to 1719, which is approximately a 13% of the total rules. Similar ratios were observed for reactions.
For the orphan rules having no sequence annotation after considering the EC class of the reactions, we followed the strategy of reducing the specificity of the EC class by reducing the number of digits. In other words, if a rule had no annotation based on the EC class at 4 digits, we looked at reactions that shared same EC class at 3 digits with one reaction associated with the rule and so on until we found sequence annotations. Notably, a sharp decrease on the number of orphan rules already occurred at the level of three digits of the EC class. The remaining orphan rules, less than 1%, was eventually annotated once we reduced the specificity from 3 digits down to 1 digit in the EC class.
We should emphasize that in the procedure described below, sequence annotations that merged multiple EC classes sharing same initial digits were only used for those cases where no sequence information was available at higher EC class levels. This annotation from higher to lower specificity in the set of sequences associated with the rules depending on known sequences allowed us to score the rules. A rule that has associated sequences with low diversity should in general correspond to cases where the sequence information is highly specific to that rule. As the diversity of sequences increases the specificity of those sequences to their associated rules becomes lower. We evaluated such degree of specificity by considering the degree of clustering of the sequences associated with the rules. Clustering of the sequences was performed by using Cd-hit (Li and Godzik, 2006). According with this algorithm, our database of 208,980 amino-acid sequences was clustered into 22,221 clusters for a similarity threshold of 0.5. We used a penalty score for the rules based on the number of sequence clusters n rule contained in the sequences selected for a given rule: where the logarithm is applied for regularization. A penalty score of 0 implies high specificity, as this means that all sequences belong to a single cluster, while high penalty scores imply multiple clusters and therefore low specificity in the sequence annotation.

Enumerating pathways between two pools of compounds
The lists of pathways linking (i) a pool of source compounds to (ii) a pool of sink compounds are computed running an algorithm we developed earlier (Carbonell et al., 2014a). This algorithm consists of the following steps for a given source compound. (1) Compute the scope, a subset of predicted reactions between the sink compounds and the set of source compounds. The scope represents the set of compounds and reactions that are involved in at least one pathway. It is computed in a two steps search. First the forward step starting from source compounds finds all reachable compounds that are producible through reactions. Secondly the backward step starting from the sink compound adds to the scope all reactions that can be involved in at least one produciblepathway.
(2) Build the stoichiometric matrix. The stoichiometric matrix describes the directed subnetwork involving the set of compounds and reactions identified at the scope step, starting from the source compounds. (3) Enumerate elementary flux modes. An elementary mode corresponds to a minimal unique set of reactions that (i) verified the stoichiometric constraints of the network and (ii) is able to carry non zero-fluxes at the system's steady-state (Schuster et al., 2000). In order to efficiently compute elementary modes, stoichiometric matrix dimension is generally reduced through lossless compression. Only enumerated flux modes linking source compounds to the sink compound are kept in order to form the final list of pathways. These three steps are performed iteratively for each source compound.
RetroPath2.0 computes the scope for each queried compound. It can be visualized and explored to retrieve the pathways thanks to ScopeViewer, a humble web-application that we provide in Supplementary. Note that the provided workflow does not explicitly extract the pathways and does not rank them. Yet, we provide at https://github.com/brsynth/rp2paths a separate utility program "RP2paths" allowing one to enumerate pathways from the results generated by RetroPath2.0.

Results
We validated our set of rules with RetroPath2.0 by checking that they were able to reproduce the known metabolic space, and that they could be used to perform reaction classification. The capability of RetroPath2.0 to perform retrosynthesis was confronted to in vivo experiments by counting the number of bioproduction pathways found for targets extracted from a database of metabolic engineering successes. We also emphasized the versatile usage of RetroPath2.0 by an original application to design biosensors (Supplementary Note 3).

Rules validation
The quality of the output of the workflow depends largely on feeding it with the proper set of reaction rules. Some authors (Henry et al., 2010;Yim et al., 2011) have published sets of rules that already constitute an initial test bed. We collected those in addition of a set of SMARTS rules that we compiled for all reactions of the last E. coli whole-cell model (Orth et al., 2011) based on MetaNetX cross-references. Those rules are available in Supplementary. All rules were checked to ensure they could be used with the workflow and yield at least one product.

Coverage of known metabolic space
In order to check the potency of the rules, i.e. that they could indeed be used to predict reactions, we tried to retrieve all reference reactions of MetaNetX from the rules. We compared three dataset of monosubstrate rules according to their origin: SimPheny (Yim et al., 2011), BNICE (Henry et al., 2010) and RetroPath2.0. To make a fair comparison we selected from all MetaNetX reactions a subset of 13,000 reactions having an associated EC number and a structure for all its compounds (SimPheny and BNICE rules are based on EC numbers). We extracted from those 6000 substrates and 7000 products (MetaNetX identifiers) excluding cofactors. For each rule dataset, all rules were applied on the set of substrates using the workflow with default parameters. We counted the number of products that could be regenerated and the number of generated compounds that were referenced in Me-taNetX.
Remarkably given the number of rules considered, 34% of MetaNetX products were recovered by SimPheny rules (50), and 41% by BNICE rules (86). They respectively generated 75,400 and 59,000 compounds, among which 5% and 7% could be found in MetaNetX and are thus connected to a biological database. Since RetroPath2.0 rules were generated from MetaNetX data we expected a better coverage over the products. This was indeed the case with 96% recovered products from MetaNetX's reactions. The few missed products originated from reactions that could not be encoded in rules due to atom-atom mapping issues. Additionally, 63% of the 17,500 compounds generated by RetroPath2.0 are new to MetaNetX, which highlights the capability of our rule dataset to generate a reasonable amount of new compounds. The fact that RetroPath2.0 rules generates less compounds than the other tested sets of rules is explained by the differences in term of diameter used. RetroPath2.0 uses a flexible diameter, which by default ranges from 16 to 2, decreasing if no rule can be used on a substrate at higher diameters. This has for effect to prioritize more conservative results (higher diameter) while ensuring that broader promiscuity are tested as a last resort (lower diameter). Overall, product coverage shows us that RetroPath2.0 rules are able to reproduce most of MetaNetX products, hence most of what is known of the metabolic space.

RetroPath2.0 rules for reaction classification
We evaluated the ability of our rules to perform automated reaction classification. To that end, reactions in the database that contained EC class annotations were grouped into their corresponding EC class at the third level. We then computed the similarity between reactions based on the signature content of their rules. For a given diameter d, each rule was decomposed into its elementary signatures (Carbonell et al., 2014a) and similarity between two given reactions R 1 and R 2 was computed by means of the Jaccard similarity coefficient T R R ( , ) d 1 2 applied to the two reaction rules: The previous expression ranges between 0 (minimum similarity) and 1 (maximum similarity) and has been often applied to compute similarity between compounds or even reaction that are described by binary fingerprints (EC-BLAST (Moriya et al., 2010;Campodonico et al., 2014;Rahman et al., 2014). The advantage and main difference of using rules with a selectable diameter is that we can compute the Jaccard similarity coefficient in function of the diameter d. That generates a sequence of monotonically decreasing similarities starting from 0 up to the maximum diameter of the reactants. Similarity of two reactions at diameter 0 contains the basic information about common patterns of bonds that were broken or formed in the two reactions. As we extend similarity to higher diameters, information becomes more specific to the substrates and products involved in each reaction.
In order to capture efficiently this feature of diameter dependence for Jaccard similarities between rules, we defined a global similarity parameter between reactions S(R 1 , R 2 ) extended to a diameter range [0, d] as an exponentially increasing weighted sum of the Jaccard similarity coefficients: where a is a regularization parameter. For each reaction in the database, we computed its corresponding rule and similarities based on a diameter range from 0 to 8. In total, rules were computed for 13,782 reactions contained in the database. We used a = 2 as regularization parameter.
We then tested the discriminant ability of using such reaction global similarity measure for reaction classification. Our tests were performed using the R package ROCR. We created a positive and negative set for each EC class. The positive set was formed by the set of reactions annotated for this EC class. A balanced training set was then built by randomly selecting from the negative set. For each EC class containing at least 10 data points, as well as for the total set of balanced training set we computed the area under the ROC curve (AUC), resulting in an overall AUC of 0.884 for diameter d = 8 (Supplementary Fig. S4). Such performance values are slightly higher than the ones obtained by EC-BLAST (Rahman et al., 2014) by using fingerprint-based similarities, showing the ability of the rules as reaction classifiers.

Score vs. specificity
The ability of substrate generalization of SMARTS rules can potentially be used to assess enzyme specificity. Enzyme specificity is an important factor that needs to be considered for metabolic pathway engineering. Moreover, several studies have shown that enzymes that can catalyse multiple reactions or can process multiple substrates have more evolvability capabilities than specific enzymes (Khersonsky and Tawfik, 2010;Nam et al., 2012;Orth and Palsson, 2012;Guzmán et al., 2015). Such property can be approached through our rules as they provide a means for representing chemical transformations for generalized substrates. The level of generalization of reactions and ultimately of their associated enzyme sequences could be therefore quantified using our rules. As described in Methods, one can define a specificity score by assessing the level of generalization of both the reactions and sequences having such reactions at a given rule diameter. The algorithm traverses both the reaction and sequence space in order to score reaction specificity and more specific rules get lower scores.
To evaluate the ability of the score to represent enzyme specificity, we have analysed a reference set of enzymes in E. coli that have been classified as either specific or generalist, i.e. if they can catalyse one or multiple reactions (Nam et al., 2012). For each gene, we took their associated reactions in the EcoCyc database (Keseler et al., 2013). Each reaction was mapped into their associated rule at several diameters d. The resulting scores for each gene were then aggregated. We mapped in total 787 E. coli genes, with 602 specific vs. 185 generalist enzymes, respectively.
Notably, the scores computed in that way, as shown in Supplementary Fig. S5, displayed the ability to differentiate between these two groups of enzymes, (t = − 6.5144, p-value of 2.3e−10 for a Welch's two sample t-test), with specific enzymes receiving lower ranking. We should note that the classification between specific vs. nonspecific enzymes depends on the actual knowledge and degree of detail in the description of the reactions in the reference organism and therefore the list of generalist enzymes should be updated as long as new activities are discovered (Guzmán et al., 2015). For instance, we observed a clear outlier in the set of specific enzymes that received a high score based on rules and therefore we should expect wider specificity. This was the case of gene phoA, b0383, alkaline phosphatase EC 3.1.3.1. It turned out that this enzyme has been reported to have wide specificity (Yang and Metcalf, 2004) in agreement with the high score.

Workflow validation and applications
We tested the reaction network prediction features of RetroPath2.0 workflow with two applications. The typical prediction of bioproduction pathways (see below), and the prediction of small biosensing metabolic circuits for biomarkers (see Supplementary Note 3).

Coverage of bioproduction pathways
The Learning Assisted Strain EngineeRing (LASER) database is a repository for metabolic engineering strain designs (Winkler et al., 2015). It stores more than 600 successful metabolic engineering designs (Winkler et al., 2016) that have been manually curated from the literature. Those examples are particularly appealing for testing retrosynthesis features since they include an ideal dataset of authentic positive examples of bioproduction pathways, sometimes involving heterologous enzymes. We extracted all compounds targeted for production described in the LASER database (release f6ce080a8993) and used them to assess the ability of RetroPath2.0 to find retrosynthesis pathways for real-life applications when used with all the rules from MetaNetX.
The structures of the target compounds were inferred from their name by querying PubChem and ChemSpider. 160 compounds targeted for bioproduction were extracted from LASER. To complete further this dataset, we extracted 68 compounds (MBE dataset) published in Metabolic Engineering in 2016 (volumes 33-38), a period not covered by LASER. These two datasets contained 203 distinct compounds in total once merged together based on their structure (standard InChI). Furthermore, we removed E. coli endogenous compounds that were used as our "sink". Finally, 146 distinct compounds were collected to serve as "source" compounds.
Compounds from E. coli were extracted from iJO1366 whole-cell model (Orth et al., 2011) and MetaNetX cross-references. We ignored compounds that belong to so-called "blocked pathways" which are by definition impossible to produce or consume at steady-state in a metabolic model. Such compounds do not constitute a proper source (or sink) for retrosynthesis applications because reactions explaining compound availability in the chassis could be missing. We performed a flux variability analysis to identify them. Overall, we collected 962 MetaNetX identifiers of compounds belonging to E. coli that we provide in Supplementary along with their structure (InChI).
All results were generated with a maximum of five retrosynthesis iterations and a timeout of three hours per target on a recent desktop computer. This puts us in realistic operational conditions for users that might have access to modest computational resources. Given those constraints, we successfully found at least one pathway for 81% of the targets (119/146), i.e. a set of reactions allowing the production of the target compound exclusively from E. coli endogenous metabolites. Interestingly, we found more than one pathway in most of the cases (104/119).
One of such compounds for which several pathways was found is styrene. Styrene is a building block used in the fabrication of plastics (Isikgor and Becer, 2015). LASER references one pathway for the bioproduction of styrene from phenylalanine with heterologous enzymes in E. coli (McKenna and Nielsen, 2011;McKenna et al., 2015) and in S.cerevisiae (McKenna et al., 2014). RetroPath2.0 found this pathway (Fig. 4, in red) along with five alternative one from E. coli endogenous compounds: 3-phenylpropionic acid, phenylacetaldehyde, and phenylpyruvic acid (Fig. 4, resp. F, G, and H).
Another non-natural example for which several pathways were found is terephthalic acid (TPA). TPA is a non-natural commodity chemical widely-used for its ability to form synthetic fibres, and ultimately in the fabrication of polyesters such as PET. TPA is traditionally produced from p-xylene by synthetic chemistry processes (Sheehan, 2000). The p-xylene can eventually come from lignocellulosic biomass, making the TPA a bio-based compound in such cases (Isikgor and Becer, 2015). Interestingly, two enzymatic bioproduction pathways have been reported for TPA, and they follow the same chemical transformations as the ones from synthetic chemistry (Sheehan, 2000); one from p-xylene (Bramucci et al., 2001) in Burkholderia genus, and another from p-toluic acid in Comamonas testosterone (Wang et al., 2006). RetroPath2.0 retrieved those routes and proposed alternative shorter paths from endogenous E. coli compounds such as phenylalanine, phenylpyruvic acid, and 3-phenylpropionic acid (Fig. 5, resp. K, P, and M). To the best of our knowledge, those pathways have never been implemented in vivo.
Those results highlight the interest of RetroPath2.0 for retrosynthesis applications. As an additional example, see also the pathways predicted toward ethylene glycol in Supplementary Fig. S3. Retro-Path2.0 is able to reproduce validated pathways and to propose new ones, both for natural and non-natural compounds. All results are provided in Supplementary.

Discussion
The RetroPath2.0 workflow is a versatile reaction network tool, built to be modular enough to answer most metabolic engineering needs. RetroPath2.0 takes as input a first set of compounds (the source), a second set of compounds (the sink) and a set of reaction rules (see Fig. 3). The workflow produces a network linking the source set to the sink set, where each link in the network correspond to a reaction rule. The RetroPath2.0 workflow runs under the KNIME analytics platform and is available in Supplementary material and at myExperiment.org The choice of source, sink and rule sets depends on the application. For instance, if one wishes to find all possible synthesis routes that can be engineered for a target compound, then the source set will be the target, the sink will be the set of metabolites of the chassis strain, and the rules will be the reversed form of all known metabolic reactions (cf. 4.2.1). If one is interested in finding pathways to be engineered to degrade a given xenobiotic, the source set will be the xenobiotic, the sink set can be composed of those metabolites in the central metabolism of a chassis strain and the rule set could comprise all known catabolic reactions. In the same vein, one can find sensing-enabling pathways with the set of known detectable compounds as sink, the set of target compounds to detect as source, and by using the forward rules (see Supplementary Note 3 for the detection of biomarkers). Finally if one wishes to know all possible compounds that can be produced with a chassis strain when adding heterologous enzymes, the source set is composed of the metabolites of the chassis strain, the sink set can be either empty or a set of compounds in a vendor catalogue, and the rule set should cover all reactions that could occur in the chassis strain, including heterologous enzymatic ones. Moreover, any other applications where the problem can be framed into source, sink and rule sets can be processed by the workflow including problems where compounds are not metabolites and reactions are not metabolic reactions.
The most critical feature of a reaction network prediction system is certainly how the reactions are encoded and from where this knowledge was extracted. In our case, we choose to adopt a reaction encoding based on SMARTS, a widely accepted compound query language (Daylight, 2017) that was already used successfully in such context (Hadadi and Hatzimanikatis, 2015). Unlike most rule-based reaction prediction systems, RetroPath2.0 rules are not built around the Enzyme Commission nomenclature, but rather from an automatic translation of enzymatic reactions extracted from databases, which we believe offers a refined view of enzyme's capabilities.
We showed that our rules were able to classify reactions and that our set of rules extracted from MetaNetX had a good coverage over the known reactome. A good part of the reactions that were not covered were actually reactions involving compound classes (e.g. "an alcohol"), which were removed during the rule generation steps. This type of generalized reactions were, in turn, represented in our set through our unique way of encoding reactions as generalized rules. One substantial improvement could probably be met by constraining the atom-atom mapping and reaction centre identification steps based on the exploitation of additional knowledge on the reaction and the associated enzyme. For instance by using the known alternative substrates associated to a single enzyme sequence or the EC number assignation.
Evaluating the coverage of a reaction database is interesting in order to assert the coverage of the known reactome by a given set of rules, but it cannot be used to assert the efficiency of a retrosynthesis tool. Indeed, the coverage of a reaction database depends mainly on the database from which the rules were inferred and how exhaustive the cross-links are between those two. Ideally, we would desire a set of rules being able to recover all known biochemical reactions. It means that anything less than 100% coverage evidences that the set of rules is incomplete and that more data could have been aggregated. Note that in this work we focused our efforts on MetaNetX for the sake of simplicity but it is clear that more data can be imported from other databases such as BRENDA (Chang et al., 2015).
To the author's opinion, a better indicator of retrosynthesis tools efficiency should be found in the coverage of known pathways realized in a metabolic engineering context. This is precisely what we did using the LASER database as a reference for examples of successfully engineered metabolic pathways. In that way we provided a comprehensive overview of the capabilities of our tool in order to identify metabolic engineering solutions to bioproduction for well-studied cases. The main source of misprediction that we observed in our analysis came from cases in which additional compounds absent from E. coli metabolism were needed to perform the synthesis. Indeed, we performed all computations within five iterations from E. coli, with target compounds that were not necessarily produced in this chassis nor at five enzymatic steps; moreover, some substrates could be supplemented in the media of the chassis organism. For instance, the synthesis of morphine is described for Saccharomyces cerevisiae in Thodey et al. (2014) by two pathways at three and four steps from thebaine. Thebaine is not naturally present in E. coli metabolism thus absent from the sink we used. Consequently, this example has no scope at five steps and was counted as mispredicted. Once thebaine is supplemented in the sink, Retro-Path2.0 can generate a scope with both pathways. Note that thebaine was already predicted before being added to the sink, and that doing so only allowed RetroPath2.0 to use this compound as a valid starting Importantly, not all predicted pathways can be readily implemented in E. coli. Indeed, translation of in silico models into in vivo experiments require much more constraints to be satisfied, some of those being hardly predictable. To name but a few, enzyme sequence availability, chassis ability to fold the enzymes, kinetics, intermediate compounds toxicity, and overall pathway induced stress on the cell should all be checked before going any further. In this context, RetroPath2.0 can be seen as a base on which everyone is invited to build new features in order to further improve its metabolic space exploration abilities.
Exploiting chemical diversity in order to gain access to the large catalogue of natural and non-natural chemical resources is arguably one of the most important goals for biotechnology applications. By extending metabolic capabilities of enzymes, applications in metabolic engineering, biosensors and synthetic circuits can be greatly enlarged and diversified. To that end, RetroPath2.0 brings to the community a flexible and scalable open source platform with unique metabolic design capabilities. For the first time, we allow the systematic application of a full set of validated and standardized reaction rules that can be expressed with a selectable level of specificity. Such representation, which parallels the versatility of enzyme promiscuity, allows an indepth exploration of latent abilities of natural enzymes.
The excellent coverage of the workflow along with its proved ability for recovering both known pathways and putative alternative candidate pathways show its power as an engineering tool. For that reason, we have no doubt that the tool will be received as a valuable addition to the toolbox for engineering biology. Moreover, community contributions to the workflow will likely expand further the features of the tool, even beyond metabolic design. In summary, we believe that the ability of RetroPath2.0 to rationalize and standardize design steps of biological engineering that have been traditionally performed manually by trial and error, constitutes a major contribution towards the development of automated workflows across the whole design, build, test and learn cycle.

Funding
This work was supported by the French National Research Agency Fig. 5. Enumerated pathways for the production of the non-natural compound terephthalic acid (TPA, compound A) from E. coli. Each pathway is depicted by a distinct colour. Pathway referenced in Bramucci et al., (2001)

is in teal blue (T-S-Q-O-H-C-A).
Compounds are represented by their structures, reactions by their EC numbers. TPA and sink compounds are surrounded by a solid line, intermediates by a dashed line. Reactions with unknown EC number according to MetaNetX are referenced by their MetaNetX ID. A: terephthalic acid; B: benzoic acid; C: 4-formylbenzoic acid; D: benzaldehyde; E: benzoyl-CoA; F: phenylacrylic acid; G: terephthaldehyde; H: p-hydroxymethyl benzoic acid; I: 3-phenylserine; J: 2-succinylbenzoyl-CoA; K: phenylalanin; L: 3-phenyllactic acid; M: 3-phenylpropionic acid; N: 4-(hydroxymethyl) benzaldehyde; O: p-toluic acid; P: phenylpyruvic acid; Q: p-tolualdehyde; R: 1,4-benzenedimethanol; S: 4-methylbenzyl alcohol; T: p-xylene. Cofactors have been removed for clarity; the whole scope is available in Supplementary. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)