Data-driven molecular design for discovery and synthesis of novel ligands: a case study on SARS-CoV-2

Bridging systems biology and drug design, we propose a deep learning framework for de novo discovery of molecules tailored to bind with given protein targets. Our methodology is exemplified by the task of designing antiviral candidates to target SARS-CoV-2 related proteins. Crucially, our framework does not require fine-tuning for specific proteins but is demonstrated to generalize in proposing ligands with high predicted binding affinities against unseen targets. Coupling our framework with the automatic retrosynthesis prediction of IBM RXN for Chemistry , we demonstrate the feasibility of swift chemical synthesis of molecules with potential antiviral properties that were designed against a specific protein target. In particular, we synthesize an antiviral candidate


COVID-19 antivirals
The severe acute respiratory syndrome (SARS) coronavirus disease  is an acute respiratory disease caused by SARS-CoV-2 that, to date, has infected more than a hundred million humans and killed more than two million. Despite longstanding efforts to understand the pathogenicity of coronaviruses [1], no drugs were approved before the outbreak of the COVID-19 pandemic, and thus, new systematic approaches to identify effective antiviral agents are urgently needed. Current efforts are predominantly focused on drug repurposing strategies, and lots of effort was initially devoted to the clinical investigation of a handful of promising candidates, including remdesivir and hydroxychloroquine. Initial hopes regarding hydroxychloroquine have been diminished as it was not found effective in a meta-study of human clinical trials [2]. While evidence for remdesivir has been conflicting [3,4], it has been granted an emergency approval in North America, Asia and Europe [5]. Recently, protein-protein-interaction studies identified 69 promising compounds by measuring binding affinities of 26 out of the 29 SARS-CoV-2 proteins against human proteins [6].
With limited success in the drug repurposing strategies, it is worth exploiting de novo drug discovery approaches against SARS-CoV-2. Drug discovery is a daunting challenge, with a search space of 10 60 compounds [7],~10 years from design to market [8] and costs of up to $3 billion per new FDA-approved drug [9]. Given that so far there are only~1500 FDA-approved drugs on the market [10], while the number of already synthesized molecules is at least 60 million [11], the total attrition rate of drug discovery is above 99.99%. Additionally, it takes around 10 years until a typical compound reaches the market [8]. However, the availability of high-throughput screenings of compound-protein interactions (CPI) has enabled deep learning to set new benchmarks for large-scale QSAR prediction models for predicting protein-drug binding affinity [12]. Deep learning has further been proven capable of in silico design of molecules with desired chemical properties [13][14][15][16] and shown potential to accelerate the discovery of DDR1 inhibitors [17]. A few studies used deep generative models to release libraries of (unsynthesized) candidates to target 3C-like protease, a main therapeutic target of SARS-CoV-2 [18][19][20] but these studies manually curated datasets to design 3C-like protease inhibitors.

Our contribution
Here, we aim to bridge systems biology and drug discovery, and use deep learning to explore target-driven drug design with conditional generative models. Our framework (see (A) and (B)) for conditional molecular design builds upon our previous work, PaccMann RL [21]; however, note that here we focus on protein-driven instead of omics-profile-driven drug generation. Our framework can be trained to design compounds against any primary protein structure. Deep learning for target-driven drug design was first formulated in 2018 by Aumentado-Armstrong [22] and has since been investigated by others [23][24][25]. Similarly to Chenthamarakshan et al [25], our approach implements a conditional generator that can be applied to unseen protein targets. However, instead of using a conditional rejection sampling approach, we use deep reinforcement learning and conditionally generate molecules by fusing the latent spaces of protein targets and small molecules. Importantly, our approach is demonstrated to generalize to unseen targets -it can generate high fractions of potentially binding ligands for protein sequences it has never been exposed to. We further couple our model with IBM RXN, an AI-governed platform for retrosynthesis prediction [26,27], and demonstrate the entire cycle from design to synthesis of one of the most promising compounds.
We emphasize that our framework is generic in its applicability for protein-target driven molecular design and synthesis. To showcase the methodology, we focus on on designing novel antiviral candidates against SARS-CoV-2-related virus and host proteins.
Notably, the main contribution of our work is not claiming to find the most promising SARS-CoV-2 antiviral candidates, but rather to present new technology, namely how a generic, machine-learning-driven, fully automatic pipeline for design and chemical synthesis may potentially accelerate the discovery of hit-like molecules.

Problem formulation
The goal is to develop a conditional generative model G Θ that can be queried with a protein and returns novel ligands that have high binding affinities to the target and, as a secondary objective, low toxicity. Let P denote the space of proteins and M the molecular space. We are interested to learn a mapping G Θ : P → M s.t. the reward R(p, m) with p ∈ P, m ∈ M is maximal.

Predictive models R(·)
is a multimodal reward function depending on p and m. Specifically, let with Aff(·) as the target-compound binding affinity of p and m, Tox(·) as the toxicity of m and γ ∈ R + as a user-defined parameter to control the importance of toxicity (we used γ = 0.5). Since the exact computation of Aff and Tox are intractable to compute in-silico (as they require in vitro experiments) they are approximated using Φ Tox and Φ Aff . Precisely, Φ Tox : M →ŷ Tox maps a molecule to a toxicity vector of 12 toxicity classes where 1 means toxic and 0 non-toxic. A positive reward is given if and only if the molecule is not predicted active in any of the 12 toxicity classes. Furthermore Φ Aff : P × M → [0, 1] maps a protein-ligand tuple to a probability that the ligand binds to the protein. For details on this model and the data (e.g. the threshold for binding strength) see subsection 2.2.2 and S1.2 (available online at stacks.iop.org/MLST/2/025024/mmedia).

Generative models
The two predictive models for binding affinity and toxicity are employed as reward functions for the conditional generator, which is modelled as follows.

Pretraining
First, two separate models Θ Prot : P → P and Θ Mol : M → M are devised. The parameters of both models can be optimized without supervision using training data of unlabelled proteins and molecules, respectively. These models implicitly define a mapping from discrete structures (proteins and molecules) to continuous representations. Θ Prot and Θ Mol are independent generative models that are parameterized with VAEs [28] so that both can be represented as a composition of encoder/decoder modules: The two models are tuned to optimize the ELBO: i.e. the latent code is modelled using a multivariate unit Gaussian following standard VAE formulation [28]. After training, the generative model can be utilized by sampling from p(z) and applying Θ Dec Mol : Z → M which constitutes our baseline for molecule generation.

Conditional generation
After the above models are trained, the conditional generative model is defined by G Θ : [Θ Dec Mol • Θ Enc Prot ] : P → Z → M. In other words, the conditional generator is obtained by encoding a protein with the protein VAE and decoding the latent code with the molecular decoder. This means, a molecule m is obtained from a protein p by: m = G Θ (p) = Θ Dec Mol (Θ Enc Prot (p)). The final training objective function of this hybrid VAE (G Θ ) is where P Θ (m|p) indicates the conditional probability approximated by G Θ . Since equation (2) is intractable to compute, it is approximated using policy gradient and subject to maximization using REINFORCE [29], as proposed in ReLeaSE [15]. Notably, the conditional generation is not limited to binary classifiers, as R(·) can be instantiated with any function. Figures 1(A) and (B) summarize the structure of the proposed system.  [30]. This dataset consists of N Tox = 11 765 training molecules that are screened against 12 toxicity assays of nuclear receptor and stress response pathways, and labelled with binary values for each class (toxic vs non-toxic). The test dataset consists of 648 molecules (data split is provided).

Model
The toxicity prediction model Φ Tox : M →ŷ Tox was implemented through a multiscale convolutional attention model that ingested augmented SMILES sequences [31,32] to predict the 12 toxicity endpoints.
For details see S1.1. The purpose of this in silico screening was to assess the toxicity of parent molecules and their metabolites. Our work on toxicity prediction is further detailed in [33].

Protein-ligand affinity prediction 2.2.2.1. Data
Drug-protein binding affinity data is obtained from BindingDB [34], a public database of measured binding affinities between proteins and small drug-like compounds. After data processing (details in S1.2) 1361 076 entries (7302 protein targets, 772 634 compounds) were taken as positive binding examples. Negative samples, on the other hand, were obtained by randomly assigning 187 compounds to each target from the list of compounds not reported to bind to that particular target yielding a total of 2273 726 samples. For reasons on the choice for classification instead of regression see S1.2.

Model
To predict CPI, we model Φ Aff : P × M → [0, 1] with a bimodal neural network based on the multiscale convolutional attention model (MCA [35,36]). This model ingests a molecule m ∈ M represented as a SMILES sequence and a protein p ∈ P represented as an amino acid sequence, applies convolutions (to aggregate local information) and a contextual attention mechanism (to focus on relevant substructures), before a continuous prediction y ∈ [0, 1] is obtained that is interpreted as probability that a binding occurs. For model details see S1.2.

Protein VAE 2.2.3.1. Data
For learning the protein space P we utilize 404 552 proteins from UniProt [37]. The proteins considered were selected by filtering out sequences longer than 8190 amino acids. We choose our training data to be vectorial representations of 768 dimensions rather than primary structures), specifically embeddings generated by a BERT [38] language model released in TAPE [39]. TAPE adapted self-supervised natural language models like BERT to learn continuous representations of proteins on large scale. The amino acid embeddings were pretrained on 32.6 M Pfam sequences [40] following the original BERT architecture and learned via masked-token prediction [39]. The TAPE model contains 110 M parameters and we exclusively used it for feature generation to obtain the training data for the protein VAE. To accommodate the occasional unusually long protein sequences, maximum sequence length was increased from the original 512 to 8192 tokens. In this work, we used the pretrained amino acid features from TAPE 'as is' by averaging them, i.e. without fine-tuning them for a downstream classification task. For conditional generation, however, we used the pooled instead of the averaged embeddings, without observing a notable difference in performance. The approach to use TAPE was pursued to circumvent the need to train a large-scale protein language model from scratch.

Model
As stated above, we model Θ Prot : P → P using a VAE [28] with three dense layers of sizes [768,512,256] with ReLU activation and batch normalization in both encoder and decoder. During training, we use KL annealing [41], dropout (p = 0.2), a learning rate of 3 × 10 −3 and optimize RMSE.

SELFIES VAE 2.2.4.1. Data
For learning the molecular space M we utilize 1576 904 bioactive compounds from ChEMBL. Ten percent are held out as validation set. We choose our training data m ∈ M not to be SMILES sequences (like for the toxicity and affinity predictor), but rather SELFIES strings [42], a robust adaption of SMILES that was devised for generative models and solves the validity problem in SMILES generation.

Model
The model for the chemical space Θ Mol : M → M is implemented using a VAE, following Born et al [21], i.e. it consists of two layers of stack-augmented GRUs [15] in both encoder and decoder. For details see S1.3. Notably, after Θ Mol is pretrained, we sample from p(z) and apply the decoder Θ Dec Mol : Z → M as described above to constitute our baseline for molecular generation.

Conditional generation 2.2.5.1. Data
We retrieved 41 SARS-CoV-2 related protein targets as labelled in UniProt (as on 22 May 2020) 5 . A full list of targets is given in table 2 and includes among others the 3C-like protease (M pro ), a promising candidate for antiviral compound development [43] that has already been investigated with generative models [18] and molecular docking studies [44]. Other included proteins are the nucleocapsid (N-) protein and the spike glycoprotein. The latter is a surface protein, which mediates entrance to human respiratory epithelial cells by interacting with the ACE2 receptor and is the target of chloroquine. Notably, the targets used for conditional generation do not need to be present in the training data of the affinity predictor or the protein VAE. Hence, we can generate potential ligands for proteins without the need for any binding data, a key feature to combat new diseases. Here, 9/41 SARS-CoV-2 related proteins are present in in the training data of the affinity predictor and 27/41 are present in the training data of the protein VAE.

Model
The conditional generator G Θ consists of the pretrained protein encoder and the pretrained molecular decoder, s.t. G Θ : [Θ Dec Mol • Θ Enc Prot ] and learns a mapping P → Z → M. In practice, we set γ = 0.5 in our multi-objective reward function since optimizing toxicity was a secondary objective of this work compared to optimizing binding affinity for a target protein. Moreover, we perform a leave-one-out cross-validation (LooCV) during conditional generation (i.e. N CG = 40) and evaluate the model by the fraction of generated ligands with high predicted affinity and low toxicity (Θ Aff (p, m) > 0.5 and Θ Tox (m) = 0).

AI-assisted synthesis of a candidate compound 2.3.1. Retrosynthesis prediction
First, we predicted the best synthetic routes using the interface of IBM RXN 6 , specifically, a transformer-based retrosynthesis prediction engine [26]. The prediction is built upon sequence-based models that operate on SMILES strings, specifically two molecular transformers [45]; one for forward reaction prediction (reactants to product) and one for backward reaction prediction (product to reactants). The two models are combined to explore the retrosynthesis hypertree, using a beam search based on the forward model confidence scores, to find the most likely routes to synthesize a molecule given a set of commercially available chemicals. Given a SMILES string, this model predicts a reaction tree composed of starting materials and intermediates (all represented as SMILES) as nodes and reaction types as edges.
Next, we used a transformer model [46] to predict the optimal synthesis protocol using a text representation of the predicted synthesis steps. The architecture is trained to predict a sequence of actions given a reaction encoded as SMARTS. The synthesis action generation model is trained with a large number of chemical recipes extracted from a corpus of organic chemistry procedures [27]. For details on the methodology see [26,27,45,46].

Chemical synthesis
In the following, the predicted procedure is described in detail. In a glass reactor of 100 ml, 9 ml of anhydrous THF were added at room temperature. Nine milliliter of a solution of 3-bromo-benzonitrile in THF (0.11 M, 1 mmol) were added under gentle (100 rpm) stirring. While maintaining the temperature of the reactor at 25 • C, 1 ml of LiAlH4 in THF (2 M concentration) was added dropwise across 180 s. The reaction mixture was stirred for 5 min at 25 • C and then the excess of LiAlH4 quenched with 2 ml saturated NaCl aqueous solution and stirred for 60 s. Although the brine was not provided in the predicted synthetic route, it was favored over water or an alcohol since it prevents the formation of a colloidal dispersion of aluminum hydroxide. The organic layer was collected and further analysed. 0.3 ml of the organic layer was diluted 50 times and then analyzed with an LC/MS (Agilent TOF6230). The spectrum is reported in supplementary figure S7.

Protein-ligand affinity prediction
The results of the binding affinity prediction model on validation and test data from BindingDB are displayed in table 1. The results show that the model learned reasonably well to classify compound-protein-interaction samples as binding (positive class) or non-binding (negative class). Because the conditional generation focuses on antiviral drug design, we had to ensure that the affinity predictor generalizes well for viral proteins. We therefore additionally measured model performance on~10k held-back samples from viral proteins and find reasonable generalization to viral proteins.

Toxicity predictor
Because toxicity is a major cause of the high attrition rate in drug discovery, we decided to perform a multi-objective optimization based on toxicity and binding affinity. Across ten runs on the Tox21 dataset, this model achieved a ROC-AUC of 0.877 ± 0.04, in predicting toxic vs non-toxic in the 12 Tox21 assays (ROC-AUC is obtained by concatenating the predictions for each of the 12 classes) surpassing prior results on this benchmarked dataset. Both the affinity and toxicity predictor are not investigated further herein, but employed as reward function for the conditional generation. For details see [33].

Conditional generative model 3.3.1. Workflow of RL optimization
The following procedure of conditional generative design is similar to the one described in [21] with the difference that here, the biomolecular context is a protein target instead of an omic profile. During the deep RL optimization, the agent G Θ , a hybrid-VAE receives as input a SARS-CoV-2-related protein, such as 3C-like proteases. First, the protein p ∈ P is encoded into the latent space of proteins using Θ Enc Prot : P → Z. Next the molecular decoder generates a compound using Θ Dec Mol : Z → M. This is a valid operation due to the variational constraint in the bottleneck layer of both VAEs, as imposed by the Kullback-Leibler (KL) divergence in the ELBO of Θ Mol and Θ Prot . In other words, the independent pre-training of the protein VAE as well as the SELFIES VAE steered both unimodal models to encode their data points (i.e. proteins and molecules) as samples of a multivariate Gaussian distribution. This observation is key for the construction of our hybrid-VAE consisting of a protein-encoder and a molecular decoder. Therefore, during the RL optimization, initial decodings will thus still be valid molecules. Finally, the compound-protein pair is evaluated using the reward function R : M × P → R. Over time, the stochastic optimization led to generating more compounds with high predicted binding affinities, as shown in table 2.

Results of LooCV
In this study, we aim to validate whether our conditional generative framework can go beyond current approaches for target-driven compound design [17,18,25] in the sense that it does not require explicit optimization for a specific target. We therefore investigated the generalization capabilities of our framework by performing a leave-one-out-cross-validation (LooCV) on the 41 targets. Prior to starting the RL optimizatiton, we sampled 3000 molecules from the pre-trained SELFIES VAE and predicted binding affinities (Affinity 0 ) and toxicity scores (Tox 0 ) for all those molecules. This constitutes our baseline for later Table 2. Generating antiviral compounds against unseen SARS-CoV-2 targets. For each of the 41 targets, Affinity0 shows the fraction of binding molecules sampled before training, Aff best at the best epoch of RL training, and Aff median the median across all five training epochs. The same applies to Tox best and Tox med Note that Tox0 is independent of the protein and was 8.7%. The best model is highlighted in bold for each target protein SEM abbreviates standard error of the mean.

Target protein
Affinity0 comparison. Next, the RL optimizaton was performed for 5 epochs and 500 molecules were sampled in each epoch. In table 2 we report the percentage of molecules predicted to bind and to be toxic for each of the 41 targets obtained by 41 runs in a LooCV. The results demonstrate that in 35 out of 41 cases the model proposed more binding compounds against an unseen target, compared to the baseline SELFIES VAE. The average ratio of compounds predicted to bind increased from 18% to 26% with the best epoch averaging 33% across all targets. Example density plots for 2 out of the 41 individual optimizations are shown in figure 2. We additionally optimized the generator to propose less toxic compounds. This succeeded to a lesser extent, probably at least partially caused by the lower weight in the reward function. For a qualitative evaluation, figure 3 shows a selection of the sampled molecules alongside their QED score [47].

Learned chemical space
Knowing that text-based deep molecular generative models can memorize large fractions of the chemical space [48], we sought to investigate the learned chemical space, assembled by a dataset of 10 000 random ChEMBL compounds, 3000 molecules sampled from the unbiased VAE, 3000 molecules sampled during the  RL optimisation and 82 SARS-CoV-2 candidate drugs from the literature (top 15 matches on PubChem and 69 compounds identified via protein-interaction-maps [6], excluding two duplicates). For all these molecules, binding affinities were computed alongside other pharmacological properties. Next, a UMAP [49] was performed on the ECFP4 fingerprints, a well-established index for molecular featurization, of all molecules [50] and visualized with Faerun/Tmap [51,52]. The interactive visualisation (available online 7 ) shows that the RL optimisation concentrates the compound sampling on a manifold of the chemical space that is more densely populated with binding compounds (for a snapshot see supplementary figure S4). The 3D UMAP shows that the currently investigated candidate molecules (red) are structurally fairly dissimilar, i.e. widely scattered across the chemical space. But it gives evidence that our model successfully navigates the chemical space towards regions of high reward. While this shows that the generator succeeded in its objective of generating more ligands with high affinities we are aware that the quality of the reward function remains a bottleneck of the framework.

Target selectivity
Small molecules designed to bind to a specific target can be expected to bind to at least a dozen different targets [53]. This promiscuity of ligands is an important challenge in targeted drug design as it can induce side effects such as off-target cytotoxicity or lowered efficacy. Accounting for target selectivity has been proposed as a means to reduce attrition rates in downstream clinical trials [54]. To assess the selectivity of the generated molecules we computed a promiscuity score P m,T for each molecule m and a set of targets T by measuring the fraction of targets to which the molecule is predicted to bind. The same set of molecules as described above were used and the results are shown in figure 4. Interestingly, the promiscuity of the molecules generated during RL optimization was significantly lower than the promiscuity of ChEMBL molecules as well as molecules from the unbiased VAE (Tukey's HSD test, p < 0.001 in all pairwise differences, mean for ChEMBL: 0.19, mean for Optimized: 0.11, see figure 4(A). Note that this was achieved without explicitly penalizing high promiscuity to unrelated targets. Additionally, investigating promiscuity within the set of 41 examined SARS-CoV-2 related targets reveals that promiscuity is significantly higher for the optimized molecules compared to the other two sets (Tukey's HSD test, p < 0.001 in all pairwise differences, mean for ChEMBL: 0.12, mean for Optimized: 0.27, see figure 4(B). While further work to improve selectivity can certainly be done, these results indicate that our conditionally generated molecules are not only less prone to off-target binding effects but also more likely to bind to related SARS-CoV-2 targets than the other sets of molecules.

Case study
For a more detailed assessment of the quality of the molecules, we ranked all~3000 conditionally generated molecules by their Tanimoto similarity τ (F 1 , F 2 ) = |F1∩F2| |F1∪F2| (where F 1 and F 2 are binary fingerprints of molecules m 1 , m 2 ∈ M) to the closest neighbour of the 82 literature candidates. Among the top five molecules, we found the molecule encircled in figure 3 generated against VEMP SARS2 (UniProt ID: P0DTC4), the envelope small membrane protein (E-protein), a key player for virion assembly and morphogenesis. From all 82 literature candidates, our molecule exhibits the highest Tanimoto similarity to the compounds MZ1 and dBET6 (τ = 0.64 based on RDKit fingerprint). Notably, these two pre-clinical SARS-CoV-2 drug candidates were identified by Gordon et al [6] as targeting the E-protein, exactly the protein which was used to condition the generation. MZ1 and dBET6 target E-Protein by degrading the human BRD2 and BRD4 proteins and thus preventing the virus from inducing changes in the host's protein expression.

Retrosynthesis
The top-5 candidate compounds for each protein target were further analyzed for synthetic feasibility using IBM RXN's retrosynthesis engine [26]. We performed the predictions using the Python package rxn4chemistry 8 . We used the default settings for the hypertree exploration: limiting the maximum number of synthesis steps to 6 (to obtain a reasonable yield), applying a confidence threshold on the forward model score of 0.6 (FAP, forward acceptance probability) and setting the number of beams for the search to 10 (more details can found in the documentation 9 ).
Although the generated molecules are not optimized for synthetic accessibility, more than half of the >2000 predicted synthesis routes are feasible. Overall, for 29% of the top-5 candidates per target, a synthetic route could be successfully predicted with at most six synthesis steps. Interestingly, almost half of the successfully predicted molecules only require a single or two step reactions, indicating that many of the generated molecules can be synthesized from commercially available materials in a few steps. Moreover, a correlation analysis between chemical and pharmacological properties indicates that some properties like QED and synthetic feasibility are highly correlated (for details see supplementary figure S5).

Selection of synthesis candidates 3.6.1. Selection of ACE2 target
The selected target for the first synthesis was ACE2, a host protein that is widely regarded a promising target for SARS-CoV-2 antiviral drug design [55][56][57] and was even argued a priority role [58]. Despite this encouraging evidence in favour of ACE2, studies using generative models against SARS-CoV-2 host targets are almost absent; with one exception [59]. For details on docking and drug repurposing studies on ACE2 for COVID-19 see S1.6. We aim to fill this gap and here exemplify the process of generating and synthesizing ligands predicted to bind to host targets.

Role of ACE2
ACE2 is a type 1 membrane protein that regulates the renin-angiotensin-aldosterone system (RAAS) and is predominantly expressed in lung alveolar epithelial and endothelial cells [60][61][62]. It plays an important role in regulating cardiovascular homeostasis [63,64], inhibition of cell growth, and protection from alveolar epithelial cell injury [65][66][67]. ACE2 has previously been identified as a functional receptor for SARS-CoV to mediate cell entry by its spike protein [68,69]. SARS-CoV-2 also utilizes ACE2 as a receptor [70,71], specifically by a fusion of ACE2 with the densely glycosylated spike (S) proteins [70], but with an increased binding affinity due to modifications around the centre of the binding domain [72,73]. Owing to the importance of the S-protein in viral cell entry and fusion, developing ACE2 inhibiting drugs seems like a logical step in tackling SARS-CoV-2. The plausibility of this approach was highlighted using a recombinant RBD protein to prevent SARS-CoV-2's RBD from binding to ACE2's peptidase domain [74,75]. With strong evidence of the role played by ACE2 in viral infection, much of the focus is on finding drugs that can prevent the S-protein from interacting with this enzyme.

Selection of molecule
As a demonstration of the fully autonomous pipeline for swift discovery and synthesis of molecules generated against unseen SARS-CoV-2-related protein targets, we synthesized 3-Bromobenzylamine, a molecule proposed by our generative model against ACE2. Note that on rare occasions, our generative model proposes molecules that already exist in chemical databases and as such are not de-novo. This was the case for 3-Bromobenzylamine, a bioactive compound which has been crystalized with the histidyl-RNA synthetase of T. cruzi in PDB 4YRI and is known to inhibit neurological enzymes like PNMT [76].
3-Bromobenzylamine was selected anyway for several other reasons. First, a maximum common subgraph similarity search [77] within the set of 82 COVID-19 literature candidates revealed that 3-Bromobenzylamine is a full substructure of Arbidol (Umifenovir), a broad-spectrum antiviral drug used in Asia against influenza and hepatitis [78,79]. Notably, Arbidol was proposed as an antiviral drug for COVID-19 specifically for its interaction with the ACE2 receptor [80], exactly the target against which 3-Bromobenzylamine was generated. It was first hypothesized that Arbidol may act as a virus host fusion inhibitor for SARS-CoV-2 [80,81] (thus preventing viral entry to the target cell, just like in influenza and hepatitis viruses [82,83]) and later experimentally confirmed in docking studies [84,85] as well as in-vitro [86]. An antiviral effect of Arbidol on SARS-CoV has been known for many years [87] and while COVID-19 studies are not yet fully conclusive [88], Arbidol was found effective in decreasing mortality [89] and in increasing negative rate of PCR [90][91][92]. We therefore hypothesized that 3-Bromobenzylamine as a smaller and broadly available compound, could operate in a similar mechanism of action by interacting with ACE2, especially since the presence of bromine was found important for the efficacy of Arbidol [93].
Secondly, the biochemical properties predicted in-silico were desirable, with a predicted ACE2 affinity of 0.77, a high drug-likeness (QED = 0.71), a penalized LogP of 0.25, the presence of an aromatic ring, an estimated solubility (ESOL [94]) of −2.66, a relatively low promiscuity to the remaining protein targets (0.13) and a relatively high promiscuity for the other SARS-CoV-2 related targets (0.27). The molecule has a molecular weight of 186 Dalton, passes the Lipinski rule of five and was predicted to be non-toxic in all but one Tox21 assays (NR-AR-LBD, the androgen receptor ligand-binding domain). Thirdly, the retrosynthetic route (shown in supplementary figure S6) was comparably simple (one reaction, eight reactants) and predicted with high confidence (98.5%) by the retrosynthesis model [26]. Although the reaction itself, a nitrile reduction, reducing 3-Bromobenzonitrile with lithium aluminium hydride, is challenging, it is a known, well-understood reaction that minimizes the risk of complication thus making the target compound the ideal candidate to demonstrate the validity of the end-to-end concept.

Chemical synthesis
The LC/MS shows a clear signal related to the presence of the 3-Bromobenzylamine with a score of more than 99%. No signals connected to the precursors were identified by the target screening analysis. Although it is not possible to use the qualitative analysis for quantitative arguments, the lack of evidences pointing to the presence of the precursor in the LC/MS analysis corroborates the indications that the automatically suggested synthesis route successfully completed in a quantitative way (report in supplementary figure S7).

Summary
Here, we proposed a novel framework for compound design that can be targeted towards any target protein without retraining requirements. We showcased the potential of our generative framework by tackling the problem of designing novel antiviral candidates with high binding affinity to unseen SARS-CoV-2 related targets, while controlling toxicity of the generated molecules. Without explicitly accounting for target selectivity, the generated molecules exhibited comparably low promiscuity within a random set of protein targets but comparably high promiscuity within the set of SARS-CoV-2 related targets. A future endeavour could be to optimize scaffolds of existing drugs against specific targets by coupling our approach with the recently proposed deep scaffold generator [95] or to directly optimize binding scores. Furthermore, to assess the feasibility of synthesizing the generated compounds, we estimated the retrosynthetic pathways of a subset of candidates for each target. We exemplified how our framework for target-driven de novo discovery of molecules with potential antiviral properties can be coupled with an AI-assisted synthesis platform for swift chemical synthesis. One molecule was selected for further testing and successfully synthesized using an automatically derived synthesis route. This molecule, 3-Bromobenzylamine, is a substructure of Arbidol, a broad-spectrum antiviral drug [79] with known efficacy in combatting COVID-19 [89]. We are aware that our synthesized molecule is commercially available, relatively small, and that the true biochemical activity remains unclear at this point. However, the ultimate goal of our contribution is to showcase a generic pipeline for rapid generation and chemical synthesis of molecules with desired properties, in this case, potential binding to SARS-CoV-2 related target proteins.

Limitations
We are aware that the true bioactivity of the proposed molecules can only be consolidated by in vitro and in vivo experiments. Another bottleneck in our pipeline is the accuracy of the predictive models such as the affinity predictor, which could be further improved using recently available data from a large-scale screening consisting of 1670 compounds tested against SARS-CoV-2 proteins [96]. On the one hand, it is advantageous that our model operates only on protein primary structures and thus does not necessitate the availability of 3D protein structure, a key limitation especially in situations of new infectious diseases such as COVID-19 where it takes months before tertiary structure of key proteins becomes available [97]. On the other hand, docking simulations could be beneficial to collect further evidence that the molecules indeed bind to the target structure prior to synthesis, particularly for target-site specific modeling which has not been tackled directly in here. It is therefore conceivable that some of the generated ligands bind to unintended drug binding pockets that do not or only mildly allow the drug to exert its mechanism of action. Predicting binding affinities for specific sites and conditioning the generative model on specific pockets can in principle be achieved given the pocket is well defined in terms of primary structure. However, related studies with deep generative models on SARS-CoV-2 have shown that primary structure can suffice to generate molecules with favorable binding free energies that reliably identify the druggable binding pocket within the 3D protein structure [25].

Conclusion
A key advantage of our proposed method is that it can generate ligands with high predicted binding affinities for entirely novel protein sequences, as we found in the leave-one-out-cross-validation on 41 SARS-CoV-2 related proteins. To the best of our knowledge, this is the first report of a conditional molecular generator that generalizes to unseen protein sequences. Another key novelty of the proposed system is the integration of molecular generative models with retrosynthesis and synthesis protocol prediction models. This enables the generation of de novo molecules and the compilation of the synthesis's experimental procedures without human intervention. Conclusively, we hope that our framework can be a building block towards the swift discovery of de novo molecules with desired pharmacological properties.

Code & data availability
The data that support the findings of this study are openly available at: https://ibm.ent.box.com/v/ paccmann-sarscov2. Under the link, we provide processed versions of all datasets used in the study as well as pretrained models. The following datasets were used: ChEMBL [98]: SELFIES VAE; UniProt [37]: Protein VAE; BindingDB [34]: Protein-ligand affinity prediction; Tox21 [30]: toxicity prediction; SARS-CoV-2 targets [6]: conditional generation. All models are implemented in PyTorch 1.3.1 and training was done on POWER8 processors and an NVIDIA Tesla P100. The source code for training the models presented in this submission is publicly available at: https://github.com/PaccMann/paccmann_sarscov2.