A Bayesian method for concurrently designing molecules and synthetic reaction networks

ABSTRACT In the last few years, de novo molecular design using machine learning has made great technical progress but there are still challenges to be overcome in their practical applications. This is mostly owing to the cost and technical difficulty of synthesizing such computationally designed molecules. To overcome such barriers, various methods for synthetic route design using deep neural networks have been studied intensively in recent years. However, little progress has been made in designing molecules and their synthetic routes simultaneously. Here, we formulate the problem of simultaneously designing molecules with the desired set of properties and their synthetic routes within the framework of Bayesian inference. The design variables consist of a set of reactants in a reaction network and its network topology. The design space is extremely large because it consists of all combinations of purchasable reactants, often in the order of millions or more. In addition, the designed reaction networks can adopt any topology beyond simple multistep linear reaction routes. To solve this hard combinatorial problem, we present a powerful sequential Monte Carlo algorithm that recursively designs a synthetic reaction network by sequentially building up single-step reactions. In a case study of designing drug-like molecules based on commercially available compounds, compared with heuristic combinatorial search methods, the proposed method showed overwhelming performance in terms of computational efficiency, coverage, and novelty with respect to existing compounds. We also provide the Python library ‘Seq-Stack-Reaction’ with its illustrative example of designing highly viscous lubricant molecules. GRAPHICAL ABSTRACT


Introduction
In recent years, machine learning-based molecular design has made great technological advances and has achieved remarkable outcomes in drug discovery and materials science [1][2][3][4][5][6][7][8].The design objective here is to identify the chemical structure of a new molecule with a given set of desired properties.To do so, a machine learning model that forwardly predicts the physicochemical or other properties of any given chemical structure is first constructed and then its inverse mapping is found to determine the design of the chemical structure that exhibits the desired properties in the backward direction.The former is often referred to as quantitative structure-property relationship (QSPR) analysis [9], and the latter is called inverse structure-property relationship (inverse-QSPR) analysis [10].Usually, the inverse problem is solved using heuristic search techniques such as a genetic algorithm.A molecular generator is then used to sequentially modify a candidate molecule such that the resulting predicted properties fall into the region of the desired properties.The generative model plays an important role in this process.Traditionally, structural modifications have been performed by conducting stochastic recombination with a predefined set of molecular fragments or random atomic substitution.By utilizing fragments of existing molecules as building blocks, we can restrict the degrees of freedom in the resulting chemical structures and narrow down the search space to enhance the synthesizability of virtually created molecules.However, excessive narrowing of the search space may reduce the novelty of the structures created.In order to overcome this limitation, increasing attention has been paid, since around 2017-2018 in particular, to the development of molecule generative models that rely on advanced machine learning techniques [1,3,4,[11][12][13][14][15][16][17][18][19][20][21].With a training set of compounds synthesized thus far, a generative model for molecular graphs is constructed to mimic the rule of frequently appearing chemical fragments and bonding patterns in the training molecules.Using such a model, we can freely scan the vast chemical space to identify innovative hypothetical molecules.
As described above, machine learning-based molecular design has made great technical progress over the past few years.However, there are still hurdles to be overcome in their practical applications.One reason for this is owing to the difficulty in determining the synthetic routes to such designed molecules.Concurrently, with the growing attention to molecular design, significant progress has been made in computational methods for synthetic route design [22][23][24][25][26][27][28].Similar to the molecular design task, the general workflow for synthetic route design consists of forward and inverse problems.The goal of the forward problem is to derive a model that predicts the chemical structure of a synthetic product for a given set of reactant molecules.In contrast, in the inverse problem, inverse mapping of the forward model is explored to identify a set of reactants that produces a given desired product.Recent developments in deep learning technologies have significantly improved the accuracy of predicting reaction outcomes in organic synthesis.For example, if the structures of the reactants and products are treated as graphs, the reaction prediction task can be formulated as a graph transformation problem [23].Under the string representation of reactants and products according to the simplified molecular input line entry system (SMILES) chemical language [29], deep neural networks for sequence-to-sequence translation can be utilized to predict the SMILES string of a product from input reactants.For example, it was reported that the reaction prediction using Transformer, a well-known encoderdecoder architecture for machine translation, could successfully predict the chemical structures of synthetic products with over 90% accuracy [22].Furthermore, several methods have been developed to solve the inverse problem of such forward reaction prediction models.The objective is to identify a set of promising reactants from a list of commercially available compounds that can synthesize a desired product [30][31][32].
More recently, several attempts have been made to simultaneously design desired functional molecules and their synthetic routes under a unified methodological framework.For a given reaction prediction model, a virtual library of candidate molecules can be created by which a set of reactants is given to the model to produce its synthetic product.In addition, the properties or a certain score of a candidate molecule can be calculated using machine learning models or an arbitrary reward function.By obtaining the inverse mapping of such a cascade model, which defines the composite mapping from any given reactant set to a product and from the product to its properties, the products exhibiting desired properties and their reactant sets can be predicted simultaneously.The Molecule Chef algorithm proposed by Bradshaw et al. [33] used a cascaded forward model that connects a deep generative model of reactant molecules (Molecular Transformer) for the reaction prediction [22], and a property prediction model for the created virtual products.Gottipati et al. [32] formulated the inverse problem of such a cascade model as a combinatorial optimization problem over a set of commercially available compounds, and proposed a reinforcement learning algorithm to identify the promising combination of reactant molecules.Gao et al. [34] formulated the problem of synthesis planning as generating a synthetic tree, and proposed a Markov decision process method to construct the synthetic tree in a bottom-up manner.Horwood et al. [35] considered a reinforcement learning framework for molecular design to use known chemistry as a starting point and optimize it by sequentially performing reactions.However, there are still many unsolved technical problems in these existing methods, and their predictive performance remains yet to be improved.One of the technical difficulties comes from the complexity of the search space.The search space grows exponentially due to the degree of freedom in the network topology of designed synthetic pathways and the possible combinations of reactants.Therefore, it should be theoretically guaranteed that a given computational method can reach the entire search space.The computational burden of exploring such a large space is another important issue; more attention should be paid to developing methods that accelerate the search process.This study aims to provide solutions to these problems.
In this paper, we formulate the task of simultaneous design of molecules and synthetic routes as a general statistical problem in which a forward model is defined as a cascade of reaction prediction models and property prediction models, as in the previous work of Gottipati et al. [32].In the inverse problem, we seek a set of reactants and a network structure of synthetic reactions such that the resulting products reach a predefined region of desired properties.The overall search space for the reactant sets is spanned by all possible combinations of given commercially available compounds.If the number of commercial compounds is of the order Oð10 6 Þ and the number of reactants involved in a synthetic route is 10, the size of the search space will reach Oð10 60 Þ.The structure of the reaction network is a design variable as well, involving the number of reaction steps and the network topology.For example, a pattern of branching routes, the width and depth of networks, and the number of leaf nodes are included in the design variable.Therefore, it is necessary to solve the intractable combinatorial problem defined over an extremely vast search space.
Here, we present an efficient algorithmic technique and implementation to solve this hard problem.Specifically, we present a sequential Monte Carlo method based on a recurrent network search algorithm to simultaneously identify a reaction network, its constituent reactant sets, and the final products that satisfy the arbitrary target properties.An important practical requirement considered here is the maintenance of diversity in the designed molecules and synthetic routes.There are always errors in the properties and synthetic routes predicted by statistical models.Therefore, the optimal solution in a model is not optimal in practice.Our method aims to exhaustively identify a wide variety of promising candidates and to present various scenarios to domain experts.The final decision is left to the expert.A Python library called Seq-Stack-Reaction has been made available on GitHub [36], which allows us to plug-in any forward model into the Bayesian design workflow (Figure 1).

Forward model: synthetic reaction
Suppose that a single-step reaction is given as S 1 and S 2 denote two reactants, and P denotes the product, which is assumed to be a singleton as byproducts are ignored here.In this study, we considered only synthetic reactions with two reactants.In this study, we considered only synthetic reactions with two reactants, but the proposed method can be generalized to handle any number of reactants.Our software, which will be introduced later, can design reaction pathways without limiting the number of reactants.In addition, solvents and reagents can also be incorporated on demand.
Consider that the single-step reaction is modeled by a function r as

P ¼ rðSÞ;
(2) where S ¼ fS 1 ; S 2 g.Function r represents the change in the chemical structure from S to P. Note that r is a set function that is invariant to the exchange of S 1 and S 2 .As described later, the single-step reaction prediction model was modeled using Molecular Transformer.
Arbitrary reaction networks can be modeled by combining the single-step model.For example, consider a three-step single-chain reaction as Step 1 : Step 3 : P 2 þ S 31 !P 3 ð¼: PÞ: In the first step, two reactants, S 11 and S 12 , produce the intermediate product P 1 , followed by the second step, which produces the second intermediate product P 2 by reacting P 1 and a newly selected reactant S 21 .In the third step, the intermediate product P 2 and reactant S 31 react to produce final product P :¼ P 3 .This reaction cascade can be expressed using the single-step reaction model, as follows: The final product P ¼ P

Forward model: property prediction models and scoring function
Once the final product P is generated as P ¼ gðSjGÞ with any given S, we estimate its properties Y using the prediction model Y ¼ hðPÞ as As the product can be represented by the deterministic function gðSjGÞ of S, Y can be expressed by a function of S as Y ¼ f ðSjGÞ.Here, Y can be a vector of one or more properties.Function h can be specified arbitrary; for example, a machine learning property prediction model or a scoring function such as the quantitative estimate of drug-likeness (QED) score [37].The objective of molecular design is to identify a reactant set S with the resulting P exhibiting a set of desired properties Y � with respect to the given forward model Y ¼ f ðSjGÞ.For the molecular design task, it is necessary to define a measure of the discrepancy dðY; Y � Þ between the predicted properties and the target.A typical example of a discrepancy measure is the Euclidean distance: Additionally, various measures can be defined depending on the type of task.For example, if the target property is given as a region U � , we can use the following 0-1 loss: dðY; Hereafter, we consider dðY; U � Þ, but we can use any type of discrepancy without loss of generality.Alternatively, in the case of a score-type monotonic measure such as QED, d can be defined, for example, as Furthermore, in the task of retrosynthetic prediction, where a target product P � is given as the design purpose, we can use, for example, the 0-1 loss between P and P � as follows: This is equivalent to setting hðPÞ ¼ 1 and Y ¼ P in the property prediction model, Equation (5).

Bayesian inverse problem
Herein, we describe the task of designing molecules with their synthetic routes.Suppose that a collection of N commercially available compounds is given by A reactant set S that forms the designed G should be selected from B. Here, by P k ðBÞ, we denote the set of all k combinations from B. Then, the support PðBÞ of S is expressed as where the maximum number of reactants that can be selected is constrained to K.
The design variable is denoted by a tuple x ¼ fS; Gg.Here, we write the model in Equation (5) as YðxÞ ¼ f ðxÞ ¼ f ðSjGÞ in order to explicitly express that the predicted property Y is a deterministic function of the reactant S and the reaction network G.As in our previous studies [1,30], molecular design based on Bayesian inference is performed based on the target distribution πðxÞ, which is defined on PðBÞ: According to Bayes' law, the posterior distribution πðxÞ ¼ pðxjY 2 U � Þ is proportional to the joint distribution pðx; Y 2 U � Þ, which consists of the product of the likelihood pðY 2 U � jxÞ and prior distribution pðxÞ.
The forward model forms the joint probability distribution, which is modeled by the Gibbs distribution , with temperature parameter σ > 0. The normalizing constant Z is the canonical partition function: The prior distribution pðxÞ is used to narrow down the broad solution space based on prior knowledge.For example, the prior can be modeled as follows: The indicator function Ið�Þ takes the value one or zero depending on whether the designed reactant set S belongs to a subset of purchasable compounds.
The second and third terms on the right-hand side penalize the increasing number of reactants (jSj) and the size of the designed network (jGj), where τ and τ 0 determine the magnitude of penalties.By identifying x with a sufficiently high posterior probability, we predict product P and its synthetic reaction route fS; Gg that satisfy design objective U � .However, computational difficulty here arises from the extremely large search space.The search space is composed of all combinations of reactants that are purchasable.The number of candidate reactants is typically of the order Oð10 6 Þ, resulting in the cardinality of the solution space burgeoning to Oð10 6�k Þ when k reactants are involved in the synthetic route planning.In addition, the network topology to be explored further increases the size of the search space.

Sequential Monte Carlo in general
For the main building block of the proposed method, we employed a sequential Monte Carlo (SMC) algorithm [38] to draw a promising sample set of x ¼ fS; Gg from the target πðxÞ.As mentioned above, because the target distribution is defined on a large combinatorial space, an ordinary SMC cannot approximate it adequately.In particular, it is difficult to obtain a diverse set of highly probable molecules with their reaction routes using such ordinary methods, which will be demonstrated later.The proposed method, shown in the next subsection, was developed as an extension of SMC to overcome the difficulty of combinatorial complexity.To clarify the design concept of the proposed method, we briefly describe the general SMC method here.
In conventional SMC, we define an augmented target distribution π A ðxÞ using T auxiliary distributions, π t ðx t Þ ðt ¼ 1; . . .; TÞ, and an arbitrarily chosen initial distribution π 0 ðx 0 Þ: The augmented variable x ¼ ðx 0 ; x 1 ; . . .; x T Þ consists of T þ 1 auxiliary variables.The goal of SMC is to efficiently approximate the entire system π A ðxÞ with the Monte Carlo approximator, successively sampling x t in the order t ¼ 0; 1; . . .; T. The definition of the auxiliary distribution is arbitrary.For example, if the last auxiliary distribution is defined as the original target π T ðx T Þ ¼ πðxÞ, we can take a sample set of x T to obtain an approximate distribution.
Alternatively, all T auxiliary distributions π 1 ðx 1 Þ . . .; π T ðx T Þ can be set to be identical to the target πðxÞ.In this case, all samples of x 1 ; . . .; x T can be used for the approximate inference.The essence of SMC methodology is to be able to use any sequence of auxiliary distributions to efficiently obtain random samples from the intractable target distribution.As a constraint to be satisfied, it is imposed that the last auxiliary distribution is consistent with the target distribution.
To derive the sampling algorithm, we rewrite the augmented distribution in Equation ( 15) as The conditional probability distribution ηðx t jx tÀ 1 Þ, called the proposal distribution, determines the transition process from x tÀ 1 to x t .Assume that we currently have a sample set fx i tÀ 1 ji ¼ 1; . . .; mg of x tÀ 1 that follows π tÀ 1 ðx tÀ 1 Þ.This set defines a Monte Carlo approximation of π tÀ 1 ðx tÀ 1 Þ as The indicator function Ið�Þ takes the value one if the argument is true; otherwise, it takes zero.The purpose of each step of SMC is to derive a conversion from πtÀ 1 ðx tÀ 1 Þ to πt ðx t Þ based on the form in Equation (16).To be specific, consider the following recursive formula derived by substituting the approximate distribution πtÀ 1 ðx tÀ 1 Þ into Equation ( 16): Based on this form with the given πtÀ 1 ðx tÀ 1 Þ, we generate a sample set fx i t ji ¼ 1; . . .; mg of x t from πt ðx t Þ by performing the sampling importance resampling (SIR) method [39] with the importance weight given by wðx t jx i tÀ 1 Þ ¼ . The algorithm starts with an initial sample set fx i 0 ji ¼ 1; . . .; mg , π 0 ðx 0 Þ and repeats the following steps for t ¼ 1; . . .; T: (1) Draw a particle x i t from the proposal distribution ηðx t jx i tÀ 1 Þ for each of i ¼ 1; . . .; m. (2) Calculate the importance weight wðx i t jx i tÀ 1 Þ to obtain the approximate distribution as (3) Resample fx i t ji ¼ 1; . . .; mg with probability proportional to wðx i t Þ to renew the m samples as following an empirical distribution πt with the equal weight as in Equation (17).
In step 1, the previous sample set is tentatively replaced with a new one according to the proposal distribution ηðx t jx i tÀ 1 Þ.The proposal distribution is designed to search a neighboring area of the currently obtained x i tÀ 1 with a certain probability and to search a still unexplored region with the remaining probability.This strikes a balance between exploitation and exploration.In step 2, we calculate the importance weight representing the goodness-of-fit of the proposed x i t .Finally, in step 3, we determine the survival or death of x i t according to the importance weight.The SMC is essentially equivalent to a genetic algorithm.
As shown later, the simple SMC performs very poorly in our molecular design task.The support of the posterior distribution is extremely large; moreover, the promising solution sets are widely scattered in the huge support.Because conventional SMC cannot cope with the target task, we developed an extended method following the SMC framework.

Recurrent design algorithm for synthetic reaction networks
Here, we begin by depicting the key idea of the recurrent design technique for reaction networks following the example shown in Figure 3.Note that for any reaction network, the final product at the root node has two reactants in its parent nodes.We consider that at each step of SMC, denoted by t, only the two reactants of the final product are explored.In the forward cascade model Y ¼ h � gðSjGÞ, only the single-step reaction S 1 þ S 2 !P is considered as G, and the two reactants fS 1 ; S 2 g are selected from the original set B of commercial compounds and an additional set I tÀ 1 containing all intermediate products computationally synthesized until step t À 1.Let B t ¼ B [ I tÀ 1 be the expanded set of candidate reactants and let x t ¼ fS 1 ; S 2 g be the auxiliary variable at t.Then, the auxiliary distribution π t ðx t Þ is defined as Unlike the general form in Equation ( 5), the forward model Yðx t Þ ¼ h � rðSÞ represents only the single-step reaction model rðSÞ, and consequently the auxiliary distribution π t ðx t Þ is a function of two reactants only.Nevertheless, the model is able to represent general reaction networks by selecting the already calculated intermediate products contained in B t , to which their reaction networks are implicitly assigned.Specifically, we proceed with the following SMC procedure.In step t À 1, a sample set fx i tÀ 1 ji ¼ 1; . . .; mg of reactant set x tÀ 1 is obtained, and a new set fx i t ji ¼ 1; . . .; mg is generated using the proposal ηðx t jx tÀ 1 Þ, where each x t is sampled from B t .Here, all newly generated products P i ¼ rðx i t Þ (i ¼ 1; . . .; m) are added to the set of candidate reactants in the next step, as follows: Thus, the support B t of the auxiliary distribution constitutes a sequence of increasing sets as At each step, the reactants are sampled from the auxiliary distribution π t ðx t Þ defined on B t � B t following the procedure described above.After calculating the predicted products P i ¼ rðx i t Þ using the single-step reaction model and predicted properties , and calculating the importance weights, resampling is performed to determine the survival or death of x i t .Because the solution space for each t is restricted to the support of two reactants B t � B t , no combinatorial explosion occurs.
Additionally, as t increases, the network topology that can be represented by B t increases monotonically.In principle, if t approaches infinity under the use of a proper proposal distribution, then B t can represent any network topology.We call this method the Bayesian sequential stacking algorithm, which describes the process of constructing a reaction network by recurrently stacking single-step reactions (Figure 3).
For the proposal ηðx t jx tÀ 1 Þ, we employ a mixture model such that a neighboring reactant of each in x tÀ 1 is selected from B t with probability α, and with probability 1 À α, x t is chosen completely at random from B t � B t in order to obtain a renewed x t .Probability α is a hyperparameter that controls the trade-off between 'exploitation' and 'exploration'.The 'exploration' creates a mechanism that enhances the diversity of solutions.The problem we face here is the computational cost of the neighborhood search.The set of candidate reactants B t grows monotonically with each step.Calculating the similarity between all entries in fx i tÀ 1 ji ¼ 1; . . .; mg and the monotonically increasing B t at every step is quite time-consuming.Therefore, we introduce a method to reduce the computational complexity, as described below.
The initial set B of commercial compounds is divided into K clusters according to the pattern of the chemical structures.First, the chemical structure S is transformed into a descriptor vector ϕðSÞ of length 3239 with the concatenation of RDKit fingerprint [40] (length 2048), MACCS (Molecular ACCess System) keys [41] (length 167), and Morgan fingerprint [42] Figure 3. Design of synthetic reaction networks using the SMC calculation based on sequential stacking algorithm.In the current step, all sampled reaction networks and their products are added to the reactant pool in the next step.By searching for single-step reactions from this expanding pool of reactants, the network can be built up recursively.
of radius 2 (length 1024).Then, generative topographic mapping (GTM) is applied [43], which is a well-established integrated method of dimensionality reduction and clustering, in order to obtain a unique mapping from vectorized chemical structures to class labels as k ¼ kðSÞ;kðϕðSÞÞ where k 2 f1; . . .; Kg: (23) Thus, an arbitrary compound set can be partitioned into K disjoint clusters (Figure 4).Using a trained GTM, candidate reactants that are newly added to B tÀ 1 are sequentially grouped into each of the K predefined clusters C tÀ 1 k (k ¼ 1; . . .; K).Here, R 2 C k denotes an element of cluster k, where the cluster members vary with step t; here, we omit the subscript indicating the dependence of the cluster on t.The model η 0 ðx t jx tÀ 1 Þ for the neighborhood search in the proposal distribution selects x t ¼ fS 1;t ; S 2;t g with equal probability from cluster R 2 C kðS i;tÀ 1 Þ (i ¼ 1; 2) to which each reactant S 1;tÀ 1 or S 2;tÀ 1 in x tÀ 1 belongs.The explicit form of the probabilistic model can be expressed as For the model corresponding to 'exploration', we sample a candidate from B t with equal probability.In summary, the proposal distribution is given by the two-component mixture distribution as The first and second terms function as the 'exploitation' and 'exploration' mechanisms, respectively.Note that the number of clusters K in GTP can also be interpreted as a hyperparameter that determines the trade-off between exploitation and exploration in SMC.When the number of clusters is increased, the homogeneity of molecules within a cluster increases.Therefore, the similarity of the candidate molecules generated from the proposal distribution will also increase.This corresponds to placing a higher weight on the exploitation mechanism.Conversely, when the number of clusters is small, the probability of being replaced by a candidate molecule with a large structural difference tends to increase.

Acceleration techniques
Here, we consider the problem of slow computation of the deep neural network for the reaction prediction.In this study, we used Molecular Transformer developed by Schwaller et al. [22] for the forward prediction of a single-step reaction.In our test, a single-step reaction prediction of m ¼ 500 particles using Molecular Transformer required 25-40 seconds on average on a Linux server with an NVIDIA V100 GPU.When the number of steps was set as T ¼ 5000, the total runtime of the reaction prediction exceeded 56 hours.We therefore used a computationally light surrogate model Y ¼ dðSÞ that directly predicts property Y from a given set of reactants S without going through the reaction prediction to generate the product.The chemical structure of each reactant in S was converted Using a pretrained GTM, any query reactant is uniquely mapped to a cell or cluster to which similar reactants belong.A structurally similar reactant of the query compound can be obtained by sampling the instances in the cell with equal probabilities.
into a binary vector of length 3239 by concatenating the Morgan fingerprint with radius 2 and bit length 1024, the MACCS keys, and the RDKit fingerprint.The intersection of the binary vectors for the two reactions in S was then taken to obtain a singledescriptor vector.From the United States Patent and Trademark Office (USPTO) open chemical reaction dataset [44], which contains 1.1 M reactions, we selected a subset of 492k reactions that involve exactly two reactants.We then substituted the selected reactant pairs in Molecular Transformer to obtain their products and calculated their model properties Y using the property predictor h.The gradient boosting regressor [45] was then trained to learn the mapping from S to Y using 80% of the 492k instances as the training set.For hyperparameter tuning, using 20 candidate values for the learning rate, 15 for the max depth, and 10 for the number of estimators, we performed 10-fold cross-validation looped within the training set and selected the optimized hyperparameters attaining the smallest mean absolute error (MAE).In the calculation of the importance weights in the SMC module, the surrogate Yðx t Þ ¼ dðSÞ was used instead of Yðx t Þ ¼ h � rðSÞ.Samples drawn from the proposal distribution at each step of the SMC were handed over to the validation module with Molecular Transformer r and property predictor h for the exact calculation of their synthetic products and resulting properties.
We also considered a technique for parallel computing, wherein the entire algorithm was divided into two modules: SMC and the aforementioned validation calculation of products and properties using the forward models.The SMC module sequentially feeds the reactants sampled at each step into the module of the forward model.The forward-prediction module calculates the synthetic products and properties of the received reactants and adds the calculated products to the pool of candidate reactants in the SMC module.These two units carry out the computation of the reactant sets flowing from the different steps of SMC.The two modules perform their tasks simultaneously once the data exchange is complete, synchronizing the timing of the data exchange, as schematically described in Figure 5.The SMC module proceeds with its calculation as t ¼ 0; 1; . . .without any suspension (top row in Figure 5).In contrast, the forward-prediction module is subject to an idle state.Once the calculation of one step of the SMC is completed, the set of reactants produced there is handed over to the module for the forward calculation in an idle state.The forward-prediction module then calculates the predicted products using Molecular Transformer, which are sent back to the SMC module.Once the data transfer is completed, the forward-prediction module returns to the idle state and waits to receive the next reactant set.In the example shown in Figure 5, we assume that the SMC module is approximately 1.5 times faster than the forward-prediction module.In this case, three forward-prediction modules are allocated to different processing units in order to reduce the idling time as much as possible.

Summary: Bayesian sequential stacking algorithm for molecular design
The proposed molecular design algorithm is summarized in Algorithm 1.The overall workflow consists of two modules: SMC and calculation of the forward model.The SMC module performs Once the calculation of one step of the SMC is completed, the set of reactants produced there is handed over to the module for the forward calculation in an idle state.The forward-prediction module then calculates the predicted products using Molecular Transformer, which are sent back to the SMC module.Once the data transfer is completed, the forward-prediction module returns to a idle state and waits to receive the next reactant set.
posterior sampling of the single-step reactions using the current pool B t of the candidate reactants.The sampling calculation consists of (1) resampling based on the goodness-of-fit of the current set of reactants, (2) updating the set of hypothetical reactants based on the proposed distribution, and (3) calculating the goodness-of-fit importance weights.All reactant sets generated from the proposal distribution are handed over to the forward model module to generate the predicted products and characterize their properties.All the products generated in this module are successively added to the pool of candidate reactants B t in the SMC module.The monotonically growing reactant pool B t contains the intermediate products calculated from the reaction prediction model.Therefore, throughout the search for single-step reactions in SMC, reaction networks with various structures can be constructed by sampling and stacking intermediate products for which their reaction networks have already been calculated.The computational complexity of the neighborhood search owing to the monotonic growth of B t is suppressed by pre-clustering the set of reactants using the GTM.
In summary, the algorithm is based on four ideas: (1) the recurrent algorithm for the network search, which is based on the sequential expansion of the reactant pool, (2) the avoidance of the neighborhood search from large reactant pools using GTM clustering, (3) efficient computation of the forward cascade model using a surrogate model, and (4) the asynchronous parallel computation algorithm.

Software and potential applications
The analyses shown later can be performed using the Python package Seq-Stack-Reaction [36], which we have made available online under the license BSD 3-Clause [46].Seq-Stack-Reaction can be installed with Conda [47].Users can plug-in any model for property and synthetic reaction predictions.The list of commercial compounds and the maximum number of reaction steps can also be specified arbitrarily.In the latest version, reactions with a variable number of reactants, such as one-and two-component systems, can be incorporated into the design calculations.In a parallel computing environment, the asynchronous parallel search algorithm can be executed by specifying an option.Additionally, a visualization of the predicted synthetic pathway network of the excavated products is implemented.
In the following, we show an example of applying Seq-Stack-Reaction to the design of drug-like molecules.However, the present method can be applied to general molecular design tasks in materials research.To demonstrate the versatility of the proposed method, as an additional example, the GitHub website provides a sample dataset and Python scripts to design lubricant molecules by following the task designed in [48].As the target property, we choose the viscosity index (VI) that indicates the temperature sensitivity of viscosity.To maintain stable machine operations, oil with a high VI value is required by machinery equipment.Because the VI is known to underestimate the viscosity susceptivity of low-viscosity oils, a complementary index called the dynamic viscosity index (DVI) was also added to the design goal.Using a structure-property dataset obtained from all-atom classical molecular dynamics simulations in [48], we predicted the VI and DVI from the chemical structure of any given lubricant molecule.Its inverse map was then computed to identify highly viscous molecules with their synthetic pathways that could be designed from a given set of reactants.

Results and discussion
The predictive and computational performances of the proposed method were evaluated using an application example that involved designing drug-like molecules.In particular, we constructed several variants of the Bayesian molecular design algorithms by combining the four constituent mechanisms described in the previous section, and compared their performance to quantitatively investigate their individual contributions to the overall scheme.

Target properties
As target properties Y ¼ hðPÞ, we considered the following two physicochemical properties that quantitatively express the drug-likeliness of a designed molecule P.
• QED The quantitative estimate of drug-likeness (QED) quantifies drug-likeness as a score ranging between 0 and 1 [37].QED was modeled on a dataset of 771 known oral drugs using eight descriptors, including molecular weight, polar surface area, and number of hydrogen bond donors and acceptors.The higher the QED value, the more drug-like the molecule is judged to be.The target range of the QED was set to be greater than 0.8.• logP The octanol-water partition coefficient logP is defined as the normal logarithm of the ratio of the concentrations of molecules in the organic and aqueous layers at equilibrium.According to Lipinski's rule of five [49], the target range of logP was defined as not exceeding 5.
Computational models of these two properties are implemented in the modules of RDKit [40].

Reaction prediction
As a forward reaction prediction model P ¼ gðSÞ, we employed Molecular Transformer [22], which is known to be a standard model.This attention-based neural translation model defines a translation between the SMILES strings of reactants and their products.For simplicity, reagents were removed from the model input.SMILES strings for multiple reactants were input into the model as separated by periods ".".The SMILES strings were all canonicalized using RDKit.
The inputs were tokenized with the regular expression according to the original paper of Molecular Transformer.
The model was trained from scratch using 80% of the training instances randomly selected from the 500k recorded reactants and products in the USPTO dataset [50].The top 1 prediction accuracy of the trained model for the remaining data reached 78.2%, which is comparable to the accuracy reported in previous studies [30,51].

Surrogate models
We applied gradient boosting regression to construct surrogate models that predict the two target properties from a set of two reactants without going through the prediction of synthetic products using Molecular Transformer.We randomly selected two pairs of 492K reactants from USPTO, generated their products using Molecular Transformer, and calculated their QED and logP.The resulting set of 394K samples was used to train the surrogate models for QED and logP.The R 2 values for predicting QED and logP for the 98K additional test samples were 0.78 and 0.86, respectively.

Commercial compounds
We used a subset of the Enamine building block catalog global stock as the set of commercially available reactants by which virtual molecules are synthesized [52].This set, consisting of 150K unique building blocks, was narrowed down by Gottipati et al. [32] to those applicable to one or more rules in a templatebased reaction prediction model.The design task is to identify the synthesizable products in this reactant set and to meet the required properties.The Enamine reactant set was also used to train the GTM model as well.Each reactant was transformed into a binary vector of length 3239 concatenated with the Morgan fingerprint of radius 2 and bit length 1024, the MACCS key, and the RDKit fingerprint.A total of 441 predefined clusters were used in the similaritybased SMC proposal.

Sequential Monte Carlo
As explained previously, the proposed method was designed based on four ideas.To investigate the contribution of each idea to the overall performance, we implemented the five variants listed in Table 1: (1) SMC-RECUR-GTM-SR-PL, (2) SMC-RECUR-GTM, (3) SMC, (4) Random, and (5) Random-RECUR.The meanings of the abbreviations attached to each method are as follows: • RECUR Recurrent design technique for the reaction networks • GTM The use of the 441 GTM clusters in the similarity-based proposal • SR Surrogate models for the direct prediction of QED and logP • PL Asynchronous parallel computing of the SMC and reaction prediction modules For the vanilla SMC in (3), only single-chain synthetic reactions with the number of steps fixed at n 2 f1; 2; 3g were considered.For example, in the case of n ¼ 2, we considered only the cascade-type reactions as S 11 þ S 12 !P 1 ; P 1 þ S 21 !P 2 and searched for three reactants S ¼ fS 11 ; S 12 ; S 21 g that synthesize product P ¼ P 2 satisfying the property requirements.For 'Random' in (4), a completely random search for two reactants was conducted for the single-step reaction S 11 þ S 12 !P 1 .The variant 'Random-RECUR' in ( 5) represents an integrated method of random search and recursive network design algorithm.The experimental conditions were set to be common to the five variants: for SMC and Random search, the number of iterations was set to T ¼ 10000, the number of particles was set to m ¼ 500, for SMC, the exploration-exploitation trade-off parameter of the proposal distribution was set to α ¼ 0:8, indicating a 20% chance of making an exploratory search.

Computational environments
The experiments were carried out on an NVIDIA DGX STATION with 4 Tesla V100 and Intel Xeon E5-2698 v4 CPUs.

Results
After the experiment, we compared the five methods listed in Table 1.As mentioned above, for the vanilla SMC, the design of the single-chain reaction routes was tested under three different step sizes n 2 f1; 2; 3g.Therefore, seven methods were included in this comparison.Each method was tested in two scenarios.In scenario 1, a relatively small number of commercial compounds was assumed to be available: the candidate reactants were generated by randomly sampling 10k of the 150k Enamine compounds available for purchase.In scenario 2, all 150k compounds were used to design the molecules.
First, we report the results of scenario 1.As a performance measure, we simply considered the number of unique designed molecules that reached the target property range.Figure 6 shows the evolution of the number of unique hit molecules at each step of the sequential calculation.Unsurprisingly, the two random search algorithms were clearly less efficient than the others.For SMC-1 and Random, whose design space is restricted to single-step reactions with two reactants, the number of hits decayed as the number of search steps increased.This means that the limited design space consisting of two combinations of at-most 10k commercial compounds can be adequately covered by the naive algorithms.In contrast, for the vanilla SMC with two-and threestep reactions (SMC-2, SMC-3) and the recurrent algorithms to explore arbitrary reaction networks (SMC-RECUR-GTM, SMC-RECUR-GTM-SR-PL), the number of hit molecules at each step remained constant, and no decay trend was observed for T ¼ 10000.This observation confirms that the search space for general reaction design is quite large.In the comparison between SMC-RECUR-GTM and its extended version, SMC-RECUR-GTM-SR-PL with the surrogate models shows that the number of hits of the latter is slightly lower than that of the former.This is a natural consequence because the surrogate models involve approximation errors.
Figure 7 shows the evolution of the cumulative number of unique hit molecules that reached the target region against the number of steps.The left panel shows the cumulative number of hits as a function of the total number of molecules generated.This is a different view of the results in Figure 6.It was confirmed that SMC-RECUR-GTM is the most efficient method to find molecules hidden in the target region.It was also confirmed that SMC-RECUR-GTM -SR-PL has a reduced hit rate due to the use of the surrogate models involving approximation errors, as mentioned above.Vanilla SMC-2 and SMC-3 also consistently found the target molecules.This suggests  1.
that it is difficult to fully cover the large design space with as few as T � m ¼ 500000 search trials.The right panel in Figure 7 shows the cumulative number of hits per step as a function of execution time (right panel).
In terms of computational cost, the parallel recursive molecular design algorithm SMC-RECUR-GTM-SR-PL with the surrogate models showed by far the highest search efficiency.With regard to the results for scenario 2, the number of hit molecules at each step (Figure 8) and the cumulative number of hits (Figure 9) were significantly lower for the two random searches, as in scenario 1.In contrast to the results of scenario 1, SMC-1 and Random, whose design space was restricted to single-step reactions with two reactants, did not show any decay in the number of hit molecules throughout all steps.This observation indicates that, as the number of commercial reactants increases, the design space becomes much larger, even for singlestep reactions.Among the methods other than the random searches, no significant difference was found in the evolution of the number of hit molecules and the cumulative number.However, similar to the results of scenario 1, SMC-RECUR-GTM-SR-PL showed by far the highest search efficiency in terms of the number of hit molecules per execution time (right panel in Figure 9).From a practical point of view, search performance against execution time is considered the most important criterion.Therefore, we conclude that SMC-RECUR-GTM-SR-PL is superior in terms of the evaluation criteria for detecting the number of unique molecules.
We then performed a quality assessment based on the structural diversity and novelty of the generated hit molecules and their coverage against existing molecules synthesized so far.Specifically, we assessed the coverage and novelty of the 300K hypothetical molecules that reached the target region in scenario 2 against a hit compound set extracted from the organic compound database ChEMBL [53] as follows: (1) The set of hit compounds, denoted by A, was obtained by calculating the two properties of 127k compounds registered in the ChEMBL.
(2) Let B be the set of 300K hit virtual molecules generated in scenario 2. (3) Evaluate the similarity between the compounds in A and B using the Tanimoto coefficient of  Ideally, a set of reasonably novel and diverse molecules should be generated, while maintaining a high coverage to existing molecules.This corresponds to a situation in which the distribution of B encompasses A. In such a case, the CN curve deviates slightly from the 45 � line and shows an upward convex pattern.For this criterion, SMC-RECUR-GTM, SMC-RECUR-GTM-SR-PL, and Random-RECUR showed better properties than the others (Figure 10).When the threshold of Tanimoto similarity was set to γ � 0:7, the coverage and novelty of SMC-RECUR-GTM and SMC-RECUR-GTM-SR-PL were both approximately 0.3 and 0.8, respectively.Note that the SMC-3 showed a significantly lower novelty than the others.As the search space grows, the ordinary SMC is more likely to get trapped in local solutions, making it more difficult to escape from them in a finite run time.The increase in the number of steps in the singlechain reaction from two to three caused the SMC to significantly degrade in performance.
Table 2 provides a summary of the performance comparison between the proposed method (SMC-RECUR-GTM and SMC-RECUR-GTM-SR-PL) and eight commonly used molecular generators: SMILES-LSTM [4], Character VAE (CVAE) [3], Grammar VAE (GVAE) [54], GraphVAE [55], Junction Tree Autoencoder (JT-VAE) [14], Constrained Graph Autoencoder (CGVAE) [58], Molecule Chef [33], and DoG-Gen [31].Here, we have reported six performance metrics (validity, uniqueness, quality, novelty, FCD, and synthetic accessibility score) as a common benchmark proposed by Brown et al. [59] that were reported in the previous studies: • Validity Proportion of generated molecules that can be parsed by RDKit • Uniqueness Proportion of unique molecules that can be parsed by RDKit • Novelty Proportions of molecules that are different from the training instances • Quality Compound quality measurements, a rule set is employed to decide whether compounds can be included in high throughput screening • FCD Fréchet ChemNet distance between generated molecules and a predefined set of existing molecules • SA score The synthetic accessibility (SA) score [60] Note that these reported values were calculated on different tasks in the different studies.Although the result cannot be used to discuss superiority or inferiority based on performance values, it is confirmed that the performance values achieved by the proposed method are at almost the same level as those of the related methods.
To evaluate the synthesizability of the designed molecules, we used the SA score, which takes values from 1 to 10, with higher values indicating more difficult to synthesize.To distinguish between compounds that are easy and difficult to synthesize, a threshold value of 6.0 has empirically been used.Using RDKit, we calculated the SA score for 20K molecules randomly selected from the hit compounds of SMC-RECUR-GTM-SR-PL. Their histogram is shown and compared with one of the existing molecules in ChEMBL in Figure 11.Most SA scores are concentrated at 3.2, indicating that their distribution is not significantly different from that of previously synthesized molecules.
Here, we remark on the chemical validity of the predicted reaction networks.To see this, three examples of predicted reaction networks are shown in Figures 12-14, which were drawn with the Seq-Stack-Reaction library.Each reaction in the three reaction pathways was assessed by our expert chemist, in which unknown reagents and reaction conditions were inferred based on expert knowledge.The validity was classified into one of f1; 2; 3g as follows: 1, feasible with suitable catalysts and/or reagents; 2, difficult to react due to low reactivity or other factors; 3, infeasible.Detailed explanations have been provided in the figure captions.The rationale for each scoring is given in the figure captions.Overall, half of the predicted reactions are considered possible; the remaining reactions are considered less reactive or infeasible.
Further, the SA score was used to evaluate the synthesizability of an intermediate and final product in each reaction step.According to this evaluation criterion, most of the products on the designed network were judged to be synthesizable.However, it should be noted that the SA score is a measure of the synthesizability of a product molecule and does not evaluate the feasibility of each reaction.In fact, although the three reaction networks contained   apparently infeasible reaction steps, the synthesizability of their intermediate products was determined to be high according to the SA scores.The accuracy of the forward reaction predictor will have to be greatly improved in order to stably present a chemically valid synthetic reaction network.Therefore, while the current methodology is useful as a decision-support tool to encourage expert ideas, it cannot be used for predictive models such as those implemented in fully automated chemical synthesis.
For further validation, it is necessary to develop methods and platforms to systematically evaluate the properties of designed molecules and the feasibility of designed synthetic pathways.The computation of molecular property characterization and synthetic reactions using quantum chemical methods is extremely time-consuming; therefore, the implementation of a high-throughput evaluation system is quite challenging.The establishment of a systematic validation method is an important issue for future research.

Conclusions
In this study, we developed a machine learning methodology and a general-purpose Python library for simultaneously designing molecules exhibiting any desired set of properties and their synthetic reactions with any network topology.A forward model was constructed using a synthetic reaction prediction model and property prediction models as building blocks, and its inverse mapping was obtained based on a Bayesian inference framework to simultaneously identify a reaction network, its constituent reactant sets, and the final products that satisfy arbitrary target properties.The reactant set was selected based on a combination of predefined commercial compounds.As the design space, consisting of arbitrary reaction networks and reactant sets, was very large, we developed a sequential Monte Carlo algorithm incorporating a recurrent algorithm for the network search.Performance tests showed that our algorithm can successfully find high-quality virtual molecules.The quality of the designed molecules was evaluated based on their reproducibility and novelty  4).( 1) the reaction sites are saturated C-C bonds, which generally exhibits low reactivity.(2) This reaction is feasible by using Fe catalysts that can dechlorinate chlorobenzene.In this reaction, the upper reactant is not necessary.(3) This reaction is feasible by using reagents that can release NH 2 anion from the upper reactant.The NH 2 anion can substitute for the OH group in the lower reactant by S N 2 reactions because the NH 2 anion has higher nucleophilicity than OH anion, and chiral inversion occurs in the product.(4) the reaction is difficult to proceed.This product can be synthesized from iodomethane and the lower reactant.
with respect to previously synthesized molecules.The distributed Python library was designed using an interface that allows users to plug-in arbitrary reaction prediction models, property prediction models, and a set of commercial compounds.This is expected to facilitate its widespread use in practical applications.
Although machine learning-based research for molecular design and synthetic pathway design has been actively pursued in recent years, most such studies have worked independently on the two subjects so far.Thus, research on the simultaneous design of functional molecules and synthetic pathways has not progressed significantly.In particular, few generalpurpose libraries are currently available.The aim of this study was to develop a generic methodology and tools to link these two subjects.We believe that this milestone has been achieved.

Figure 1 .
Figure 1.Workflow of the concurrent design of molecules and their synthetic reaction networks.The forward-prediction model defines the composite mapping h � g from a given reactant set S to its synthetic product P conditioning on a reaction network G via a reaction function g, and from product P to its properties Y via a prediction model h.The desirable products and their synthetic reactions are concurrently designed by inversely exploring G and S such that the resulting properties meet the design objective Y 2 U � .

Figure 2 .
Figure 2. Examples of synthetic reaction networks.

Figure 4 .
Figure 4. Efficient sampling scheme to generate structurally similar reactants based on pre-partitioning of the reactant space.Using a pretrained GTM, any query reactant is uniquely mapped to a cell or cluster to which similar reactants belong.A structurally similar reactant of the query compound can be obtained by sampling the instances in the cell with equal probabilities.

Figure 5 .
Figure 5. SMC algorithm using asynchronous parallel computation.The SMC module proceeds without any suspension (top row).Once the calculation of one step of the SMC is completed, the set of reactants produced there is handed over to the module for the forward calculation in an idle state.The forward-prediction module then calculates the predicted products using Molecular Transformer, which are sent back to the SMC module.Once the data transfer is completed, the forward-prediction module returns to a idle state and waits to receive the next reactant set.

Algorithm 1 .
De novo design of functional molecules and their synthetic routes.

Figure 7 .
Figure 7. Cumulative number of hit molecules in scenario 1 where the number of commercially available reactants is limited to 10K.The left and right plots show the cumulative number of hits as a function of the total number of molecules generated and the execution time (CPU time), respectively.

Figure 8 .
Figure 8. Number of design molecules reaching the target property range at each step for scenario 2 where all 150K commercial compounds were used in the design.

Figure 9 .
Figure 9. Cumulative number of hit molecules in scenario 2 where all 150K commercial compounds were used in the design.The left and right plots show the cumulative number of hits as a function of the total number of molecules generated and the execution time (CPU time), respectively.

Figure 10 .
Figure10.1-coverage (horizontal axis) and novelty (vertical axis) of the set of designed virtual molecules with respect to existing molecules in ChMBLE, which is drawn as a function of varying thresholds of Tanimoto similarity.The circle represents the coverage and novelty when the similarity threshold is set to 0.7.

Figure 12 .
Figure 12.A synthetic reaction network consists of 5 single-step reactions (1)-(5).The SA scores are given to the intermediate and final products.The validity of each reaction was classified into one of f1; 2; 3g as follows: 1, feasible; 2, difficult to react due to low reactivity or other factors; 3, infeasible.(1) This reaction is feasible with a suitable reagent that hydrolyzes the upper reactant to alcohol, the reaction between the alcohol and sulfonyl chloride (R-SO 2 -Cl) is often used in the synthesis of sulfonate esters (R-SO 2 -OR').(2) This is infeasible because the methyl group cannot be found in the reactants.(3) the lower reactant is an α, βunsaturated carbonyl, which is susceptible to attack by nucleophiles such as an amino group at the β-carbon.(4) This would be difficult, considering the general reaction of sulfonate esters, CH 2 C(=O)CH 3 is released from the sulfonate ester.The CH 2 C(=O)CH 3 and CH 3 SO 2 in the lower reactant may undergo a substitution reaction with some catalysts, but in this case, the number of carbons in the product is incorrect.(5) the reaction sites are saturated C-C bonds, which generally exhibits low reactivity.

Figure 11 .
Figure 11.Histograms of SA scores of hit compounds obtained from the proposed method and existing molecules in ChEMBL.

Figure 13 .
Figure 13.A synthetic reaction network consists of 4 single-step reactions (1)-(4).(1) the reaction sites are saturated C-C bonds, which generally exhibits low reactivity.(2)This reaction is feasible by using Fe catalysts that can dechlorinate chlorobenzene.In this reaction, the upper reactant is not necessary.(3) This reaction is feasible by using reagents that can release NH 2 anion from the upper reactant.The NH 2 anion can substitute for the OH group in the lower reactant by S N 2 reactions because the NH 2 anion has higher nucleophilicity than OH anion, and chiral inversion occurs in the product.(4) the reaction is difficult to proceed.This product can be synthesized from iodomethane and the lower reactant.
The graph forms a rooted tree in which the leaf nodes consist of the four reactants in S and the root node is given by the final product.Every node, except the leaf nodes, has two children.Without loss of generality, any synthetic reaction network, beyond single-chain reactions, can be described as P ¼ gðSjGÞ and retains the graph properties.Several examples are shown in Figure2.
3 is described as a function gð�jGÞ of the four purchasable reactants S ¼ fS 11 ; S 12 ; S 21 ; S 31 g, where all the intermediate reaction states are discarded.The structure of the reaction network is represented by the synthetic graph or network G.

Table 1 .
Five different algorithms to be evaluated for the performance test.The abbreviations are explained in the main text.Number of design molecules reaching the target property range at each step for scenario 1 where the number of commercially available reactants is limited to 10K.The correspondence between the seven methods and their labels is shown in Table

Table 2 .
Validity, uniqueness, quality, FCD, and the mean of the average SA score of the hit compounds generated from the proposed method and the performance values of other methods reported in previous studies.Note that the reported values were obtained from different tasks and experimental conditions.