MARS: a motif-based autoregressive model for retrosynthesis prediction

Abstract Motivation Retrosynthesis is a critical task in drug discovery, aimed at finding a viable pathway for synthesizing a given target molecule. Many existing approaches frame this task as a graph-generating problem. Specifically, these methods first identify the reaction center, and break a targeted molecule accordingly to generate the synthons. Reactants are generated by either adding atoms sequentially to synthon graphs or by directly adding appropriate leaving groups. However, both of these strategies have limitations. Adding atoms results in a long prediction sequence that increases the complexity of generation, while adding leaving groups only considers those in the training set, which leads to poor generalization. Results In this paper, we propose a novel end-to-end graph generation model for retrosynthesis prediction, which sequentially identifies the reaction center, generates the synthons, and adds motifs to the synthons to generate reactants. Given that chemically meaningful motifs fall between the size of atoms and leaving groups, our model achieves lower prediction complexity than adding atoms and demonstrates superior performance than adding leaving groups. We evaluate our proposed model on a benchmark dataset and show that it significantly outperforms previous state-of-the-art models. Furthermore, we conduct ablation studies to investigate the contribution of each component of our proposed model to the overall performance on benchmark datasets. Experiment results demonstrate the effectiveness of our model in predicting retrosynthesis pathways and suggest its potential as a valuable tool in drug discovery. Availability and implementation All code and data are available at https://github.com/szu-ljh2020/MARS.


A.1. Graph Transformer Network
Graph Transformer Network (GTN) (Shi et al., 2020c) utilizes improved multi-head self-attention modules in vanilla transformer (Vaswani et al., 2017) to aggregate both atom and bond features.In the head c of layer l, the query q (l) c,u , keyword k (l) c,u , and value v (l) c,u vectors correspond to the atom representation vector h (l) u , which are transformed by the query, keyword, and value matrices, as follows: (1) The GTN integrates bond features xuv, calculates an attention score to the bond (u, v), and updates the representation of atom u by message passing scheme as follows: hc,uv = W c,exuv + bc,e, α (l)  c,uv = ⟨q c,v + hc,uv⟩ c,v + hc,uv⟩ , ĥ(l+1) where ⟨q, k⟩ = exp( q T k √ d ), d is the dimension of each head, N (u) denotes the neighbors of atom u and ∥ is the concatenation operation.Additionally, the gated residual connection is introduced to avoid over-smoothing.

A.2. Global Attention Pooling Function
Following (Li et al., 2019), we use the global attention pooling function.Given the node representations hv from the output layer of GNNs, the graph representation can be computed as follows: where fgate(•) and f f eat (•) are Multilayer Perceptrons.

A.3. Objective Function
MARS is trained to predict target transformation paths using cross-entropy loss Lc for predicting new types, motifs, and interface-atom indexes, and binary cross-entropy loss L b for predicting reaction centers.The overall loss function for MARS combines these losses across all steps in the Edit and AddingMotif sequences, which is summarized as: where N1 and N2 denote the lengths of the Edit and AddingMotif sequences respectively.

A.4. Framework of MARS
In Algorithm 1, we present the overall workflow of the proposed method MARS.

A.5. Implementation Details
We implement MARS using PyTorch (Paszke et al., 2019) and Pytorch Geometric (Fey and Lenssen, 2019) library.The Graph Transformer consists of six eight-head self-attention modules, and employs attention pooling (Li et al., 2015) as the readout function.The GRU network comprises three layers.The embedding size D for our model is uniformly set to 512.Throughout all experiments, we conducted 100 epochs of training on USPTO-50K dataset with a batch size of 32.We employed the Adam optimizer with initial learning rate of 0.0003.
To optimize learning, we employed a cosine annealing learning rate with a restart cycle of 20 epochs.Training on the USPTO-50K dataset was executed on a single NVIDIA Tesla V100 GPU, taking approximately 17 hours.During the inference phase, we used a beam size k = 10 to rank predictions.

B.2. Atom and Bond Features
We adopt the atom and bond features design from Yan et al. (2020), with all feature computations carried out using the rdkit library.The atom features employed in MARS are summarized in Table 2.These features are primarily represented using one-hot encoding, except for the atomic mass, which is a real number scaled to a consistent order of magnitude.Table 3 outlines the bond features employed in MARS.These features are also represented using one-hot encoding.

B.3. Notation
We provide a comprehensive list of symbols used throughout the paper to facilitate a better understanding of the presented concepts and equations, as shown in Table 4.The output vector of GRU at t step ψt The vector derived from molecular graph embedding and the output of GRU at t step σ * (•) Linear layer of neural networks with any nonlinear activation f * (•) Linear layer of neural networks with any nonlinear activation, mapping entities to vectors h * Embedding vector

B.4. Baselines
We benchmark the performance of our proposed method against a comprehensive set of competitors, encompassing both template-based and template-free approaches.These competitors represent state-of-the-art solutions in the field of retrosynthesis prediction.The diversity of these methods offers a robust basis for evaluating the efficacy of our method.
For template-based models, we consider three prominent contenders: • RetroSim (Coley et al., 2017) selects reaction centers based on Morgan fingerprint similarity between target molecules and known precedents.
• NeuralSym (Segler and Waller, 2017) combines a fully-connect layer and a deep highway network to learn knowledge of potential correlations between molecular functional groups and reactions.• GLN (Dai et al., 2019) models the joint probability of single-step retrosynthesis to select templates and generate reactants.
Template-free models can be divided into five sequence-based models and five graph-based models.For sequence-based models, we consider five models: • SCROP (Zheng et al., 2019) combines an extra Transformer to correct predicted SMILES strings.
• RetroPrime (Wang et al., 2021) uses two Transformers to model reaction center identification and synthons completion, respectively.
• Retroformer (Wan et al., 2022) jointly learns the sequential and graphical information of molecules using a Transformer-based local-global Encoder-Decoder model.
• DualTF (Sun et al., 2021) unifies sequence-based and graph-based models using energy functions and uses an extra order model to help inference.• Chemformer (Irwin et al., 2022) is a BARt-based chemical language model with 230 million learnable weights pre-trained on 100 million molecules.
For graph-based models, we consider five models: • G2Gs (Shi et al., 2020a) employs a graph neural network to select reaction centers and generates reactants using a variational autoencoder.
• RetroXpert (Yan et al., 2020) leverages a graph neural network to predict disconnections and regards reactant generation as a sequence translation task.• GraphRetro (Somnath et al., 2021) determines the synthon through an edit prediction model and then performs a single fullconnected network to complete the synthons by using predefined leaving groups.• MEGAN (Sacha et al., 2021) defines five graph editing actions, using two stacked graph attention networks to perform retrosynthesis predictions.• G2GT (Lin et al., 2023) utilizes a Graphormer to encode the product molecular diagram as a representation vector, and then uses a decoder to convert the representation vector into a graph sequence in an autoregressive way, which represents the graph structure of the reactants.

C. Parameter Analysis
In this section, we delve into the key factors that affect the performance of MARS.We systematically explore and analyze the effects of the various components that contribute to the effectiveness of the model.Specifically, we investigate the impact of different graph encoders, representation readout functions, dimension sizes, and the number of GTN layers, shedding light on the choices that optimize MARS's performance.

C.1. Effects of Graph Encoder
Graph encoders, which are responsible for learning representations of nodes by aggregating information from their neighbors, are used to capture topological information of molecular graphs.We investigate the effect of different graph encoders on the performance of MARS, including GCN (Kipf and Welling, 2016), GAT (Veličković et al., 2017), GraphSAGE (Hamilton et al., 2017) and GTN (Shi et al., 2020c).
As shown in Fig. 1(a), GTN outperforms the other graph encoders.This is because the self-attention module in GTN can better fuse bond features into the atom representations, leading to more effective learning of the molecular graph.

C.2. Effects of Representation Readout Function
The representation readout function is utilized to aggregate all atom representations to obtain the molecular graph representation.We compare the effects of several readout functions on the performance of MARS, including max, sum, mean and attention pooling.Fig. 1(b) shows the Top-k accuracy of different readout functions.We find that attention pooling achieves the best performance for learning the representation of molecular graphs.This is because attention pooling can better extract information from atom representations that benefit downstream tasks.

C.3. Effects of Dimension Size
Embedding size K has a significant impact on performance.As shown in Figure 1(c), we test the performance of MARS when K takes on values of 64,128,256,512,1024.We find that the Top-1 accuracy is optimal when the embedding size is 512, and the performance does not increase further when the embedding size is 1024.For larger k, the accuracy of MARS is not sensitive to K.This demonstrates that 512 dimensions are sufficient for our model to perform optimally.

C.4. Effects of the Number of GTN Layers
We show the effects of the number of GTN layers in Figure 2. The findings demonstrate a gradual increase in the Top-1 and Top-3 accuracy of the model as the count of GTN layers is increased.At the same time, the performance of Top-K stabilizes after K exceeds 3. Remarkably, the model achieves its optimal Top-1 performance when the number of GTN layers is configured at 6.However, increasing the GTN layer to 7 induces a slight reduction in the Top-1 accuracy of the model due to overfitting.These observations highlight the efficacy of employing 6 GTN layers for superior representation learning of molecular graphs.

D. Visualization of Concatenating Multiple Motifs to Complete Synthons
We present three illustrative examples in Figure 3, each exemplifying the synthon completion through the amalgamation of multiple motifs.Notably, Example (a) represents a Top-2 prediction, demonstrating the accuracy and predictive capability of our model in predicting reactants.Also worth noting are the two remaining examples, where the leaving groups are formed by combining motifs absent from the training set.This not only highlights the model's ability to predict novel reactants, but also serves as a compelling testament to the inherent flexibility of our proposed motif-based approach.

E.1. Molecular Graph Generation Methods
The field of molecular graph generation has seen various approaches (Madhawa et al., 2019;Zang and Wang, 2020;Luo et al., 2021) aimed at generating chemically valid molecules with specific chemical properties.MolGAN (De Cao and Kipf, 2018) generates molecules via generative adversarial networks.JT-VAE (Jin et al., 2018) first decomposes a molecular graph into disconnected subgraphs and then designs a junction tree variational autoencoder for molecule generation.Recently, autoregressive-based models have gained much attention in molecular graph generation.GCPN (You et al., 2018) formulates molecular graph generation as a Markov Decision Process.MolecularRNN (Popova et al., 2019) utilizes a recurrent neural network to generate the nodes and edges.GraphAF (Shi et al., 2020b) designs a flow-based autoregressive model to dynamically generate nodes and edges based on historical subgraph structures.Our proposed method can be considered as a conditional molecular graph generation method based on an autoregressive model.

E.2. Template-free Retrosynthesis Prediction Methods
Template-free methods are data-driven methods that can be divided into sequence-based and graph-based methods.Sequence-based methods (Tetko et al., 2020;Wang et al., 2021;Mao et al., 2021) leverage natural language processing (NLP) techniques and treat the $FFXUDF\ WKHQXPEHURI*71OD\HUV retrosynthesis task as a machine translation problem, with molecules represented as SMILES strings.For example, Seq2seq (Liu et al., 2017) and Transformer (Karpov et al., 2019) simply apply machine translation models to retrosynthesis tasks, resulting in the generation of ineffective molecules.To remedy the grammatically incorrect output in the previous models, SCROP adds a syntax corrector to correct the output smiles to achieve better performance than the vanilla Transformer.RetroPrime (Wang et al., 2021) uses two Transformers to translate products to synthons and synthons to reactants, respectively.Chemformer (Irwin et al., 2022) fine-tunes a chemical large language model to translate the product smiles into the reactant smiles.Although these methods simplify retrosynthetic models, most of them ignore the rich structural information in molecular graphs and are poorly interpretable.To solve this problem, Retroformer (Wan et al., 2022) incorporates graphical information to predict the reaction centers and then translates synthons into reactants.
Graph-based methods (Shi et al., 2020a;Sacha et al., 2021;Han et al., 2022), on the other hand, model the retrosynthesis task as two steps: i) break the target molecule into incomplete molecules called synthons, and then ii) complete them into reactants using subgraph units such as atoms or leaving groups.For instance, methods such as G2Gs (Shi et al., 2020a), RetroXpert (Yan et al., 2020), and GraphRetro (Somnath et al., 2021) build two independent models to implement the above steps, respectively.MEGAN (Sacha et al., 2021) constructs an end-to-end graph generative model while completing synthon with individual atoms and benzene.Our work is closely related to graph-based models but fundamentally different from the above methods.First, rather than treating reaction center identification and synthon completion as two completely independent subtasks like (Shi et al., 2020a;Yan et al., 2020;Somnath et al., 2021), our work integrates these two subtasks into an end-to-end framework.Second, compared to the high prediction complexity of completing synthons with small units (Sacha et al., 2021), adding motifs to synthons can greatly reduce the length of the prediction sequence.It is worth noting that motifs are distinct from the leaving groups proposed by (Somnath et al., 2021) and the differences are discussed in subsection 3.2.

Table 2 .
Atom Feature used in MARS.All features are one-hot encoding, except the atomic mass is a real number scaled to be on the same order of magnitude.

Table 3 .
Bond Feature used in MARS.