Accelerating Inhibitor Discovery for Multiple SARS-CoV-2 Targets with a Single, Sequence-Guided Deep Generative Framework

these authors contributed equally to this work ABSTRACT The COVID-19 pandemic has highlighted the urgency for developing more efﬁcient molecular discovery pathways. As exhaustive exploration of the vast chemical space is infeasible, discovering novel inhibitor molecules for emerging drug-target proteins is challenging, particularly for targets with unknown structure or ligands. We demonstrate the broad utility of a single deep generative framework toward discovering novel drug-like inhibitor molecules against two distinct SARS-CoV-2 targets — the main protease (M pro ) and the receptor binding domain (RBD) of the spike protein. To perform target-aware design, the framework employs a target sequence-conditioned sampling of novel molecules from a generative model. Micromolar-level in vitro inhibition was observed for two candidates (out of four synthesized) for each target. The most potent spike RBD inhibitor also emerged as a rare non-covalent antiviral with broad-spectrum activity against several SARS-CoV-2 variants in live virus neutralization assays. These results show a broadly deployable machine intelligence framework can accelerate hit discovery across different emerging drug-targets.


Introduction
De novo molecular design, the proposing of novel compounds with desired properties, is a challenging problem with applications in drug discovery and materials engineering.For instance, a key objective in the drug discovery workflow is to identify candidate molecules, known as hits, that can interact with and inhibit a known drug-target protein with measurable activity.Searching for hit compounds that serve as the chemical starting points for further design of drug candidates typically involves high-throughput screening of libraries containing standard chemical compounds or smaller chemical fragments.Success rates for this method of hit discovery are between 0.5 and 1 percent 1 depending upon the size of the library screened (typically on the order of 10 4 entries) and target characteristics.This low success rate is in part due to the immense search space, now estimated to span between 10 33 -10 80 feasible molecules 2 , from which only a minute fraction typically possesses the traits sought.Exhaustive enumeration of this vast chemical space is infeasible.Consequently, the cost of developing a single new drug is high, reaching up to $2.8 billion, while the duration from concept to market typically exceeds a decade 3 .
In addition to the need for thousands of screening experiments, the initial selection of the library frequently requires detailed arXiv:2204.09042v1[q-bio.QM] 19 Apr 2022 structural information on the target protein of interest, which is often not readily available.Further, discovery is often performed using hand-crafted rules and heuristics to link existing fragments and/or to avoid impractical synthetic pathways.Therefore, a more efficient approach is urgently needed, to enable distillation of novel and promising molecules from the vast chemical space.This approach will enable experimental validation of a small selection of candidates, resulting in a higher hit discovery rate at reduced time and cost.Deep learning-based generative models have the potential to enable discovery of novel molecules with desired functionality in a "rule-free" manner, as they aim to first learn a dense, continuous representation (hereafter referred to as a latent vector) of known chemicals and then modify the latent vectors to decode into new molecules.Such models thus offer access to previously unexplored chemical space unrestricted by conscious human bias.However, for the task of target-specific inhibitor design, an "inverse molecular design" 4 approach must be utilized, where the navigation through the learned chemical representation is guided by molecular property attributes, such as target inhibition activity.In the scenario of designing inhibitors for a new target, a sufficient amount of exemplar molecules is required, which is likely unavailable and requires costly and time-consuming screening experiments to obtain.As the majority of existing deep generative frameworks (see Sousa, et al. 5 for a review of generative deep learning for targeted molecule design) still rely on learning from target-specific libraries of binder compounds, they limit exploration beyond a fixed library of known and monolithic molecules, while preventing generalization of the machine learning framework toward more novel targets.As a result, while some studies [6][7][8] that use deep generative models for target-specific inhibitor design have been experimentally validated, rarely have such models demonstrated sufficient versatility to be broadly deployable across dissimilar protein targets, without having access to detailed target-specific prior knowledge (e.g., target structure or binder library).
Our work demonstrates the real-world applicability of a single, deep-generative inhibitor design framework across different target proteins, while only requiring more readily available target sequence information to guide the design.Here, we employ a recently published generative deep learning model, CogMol 9 , to propose novel and chemically viable inhibitor designs for two important and distinct SARS-CoV-2 targets -the main protease (M pro ) and the receptor binding domain (RBD) of the spike (S) protein.In this study, the deep generative framework, pre-trained on large-scale chemical molecules, protein sequences, and protein-ligand binding data, serves as a foundation for inhibitor molecule design, as it can be extrapolated to new target sequences not present in the original training data.A set of novel molecules targeting SARS-CoV-2 proteins, which was designed by CogMol, was shared under the Creative Commons license in March 2019 (updated in May 2019) in the IBM COVID-19 Molecule Explorer platform 1 .Here, we provide experimental validation of the broad utility and readiness of the CogMol deep generative framework, by synthesizing and testing the inhibitory activity of a number of prioritized designs against SARS-CoV-2 M pro and RBD of the spike protein.We further demonstrate the applicability of the binding affinity predictor model used in the CogMol framework by subjecting it to virtual screening of a library of lead-like chemicals and successfully identifying three compounds that were ultimately confirmed to be bound at the active site of the M pro by crystallographic analysis, one of which showed micromolar inhibition.
To our knowledge, the present study provides the first validated demonstration of a single generative machine intelligence framework that can propose novel and promising inhibitors for different drug-target proteins with a high success rate, while only using protein sequence information during design.The demonstrated broad-spectrum antiviral activity of the designed spike inhibitor against the SARS-CoV-2 variants of concern further establishes the potential of such a deep generative framework to accelerate and automate the hit discovery cycle, a process known to suffer from low yield and high attrition rates.

Attribute-conditioned molecule generation with a deep generative model
The overall inhibitor discovery pipeline is described in Figure 1 and consists of three main steps: (a-c) candidate design in a target-conditioned manner using the deep generative model framework, (d) in silico screening for candidate prioritization, and (e) wet lab validation of prioritized molecules.For de novo molecule design, we used the deep generative framework CogMol as a foundation, which enables the design of inhibitor molecules for different targets, without requiring training or fine-tuning the model on target-specific data.Hereafter, we refer machine-designed novel compounds as de novo compounds throughout rest of the paper.
CogMol works as follows: first, it uses a variational autoencoder (VAE) 10 , a popular class of generative models, as the base generative model (Figure 1a).A VAE is comprised of an encoder-decoder pair.The encoder neural network maps the simplified molecular-input line-entry system (SMILES) 11 string of a molecule into a low-dimensional representation.We will denote the encoder as q φ (z|x), where z is a latent encoding of input SMILES x and φ represents the encoder parameters.The decoder p θ (x|z)), which is also a neural network, then converts the latent vector z back into the reconstructed SMILES x.The encoder in a VAE is probabilistic in nature as it outputs latent encodings that are consistent with a Gaussian distribution.The properties, docking score to target structure, and predicted retrosynthetic feasibility and toxicity.(e) A small set of prioritized molecules are synthesised, followed by wet lab testing in specific in vitro assays to confirm target inhibition.(f) In the present case, for each target, of the four molecules tested, two showed promising levels of inhibition.We also report approximate sample sizes and timeline for each stage of our discovery workflow.Note the timeline does not include the training and testing of the generative and predictive machine learning models.
decoder is therefore stochastic -it samples from the latent distribution to produce an output x.The encoder-decoder pair is trained end-to-end to optimize two objectives simultaneously.The first objective includes minimizing a loss term to ensure accurate reconstruction of an input SMILES from the corresponding latent embedding.The second objective consists of a regularization term to constrain the latent encodings to a standard normal distribution.The resulting latent space is continuous, enabling smooth interpolation as well as random sampling of new molecules from the latent space.To learn meaningful latent molecular representations that have general knowledge about diverse chemicals, in CogMol the VAE is trained on more than one and half million small molecules from public databases.
Once the chemical latent representation is learned, CogMol performs attribute-conditioned sampling on that representation to generate entirely new molecular entities with properties biased toward the design specifications.Specifically, the goal is to design novel drug-like molecules with a high binding affinity to the target protein of interest.Two z-based property predictors are used: a drug-likeness (QED) predictor and a target-molecule binding (strong/weak) predictor.Both predictors used the z encodings of molecules as input.For the binding predictor, the protein sequence embeddings from a pre-existing deep neural net 12 was concatenated with the molecular latent encodings and trained on the general protein-molecule binding affinity data available in the BindingDB database (Figure 1b).Performance of the attribute predictors is reported in the Model and Methods section.
Given a target protein sequence of interest, those two predictors are used together to sample molecules with desired properties from the latent space, by using the CLaSS sampling method proposed by Das, et al. 13 .CLaSS relies on a rejection sampling schema to accept/reject molecules, while sampling from a density model of the z embeddings.Acceptance/rejection criteria are determined by the output probabilities of property predictors.See Methods for further details on CLaSS.Note, the CogMol generative framework relies on a chemical VAE, a protein sequence encoder, and a set of molecular property predictors, all of which are pre-trained on large amount of broad data -i.e., chemical SMILES, protein sequences, and available protein-ligand binding affinities.The generative framework thus has important information already encoded about protein sequence homologies, chemical similarities, and protein-drug binding relations.This allows the framework to serve as a foundation, as it is instantly adaptable to different targets, without any further model retraining or fine-tuning on target-specific data.The approach further saves time and cost associated with generating target-specific binder libraries or resolving the target structure, which are typically considered as privileged information, i.e., not broadly available.The model can also extrapolate to a target that does not share high similarity with the training data.This is indeed the case for the SARS-CoV-2 targets considered where the lowest Expect value, a measure of sequence homology (lower values indicate high homology), with respect to the BindingDB protein sequences is 0.51 (query coverage = 40%) and 1.9 (query coverage = 26%) for M pro and spike RBD, respectively.This analysis implies that both targets are not significantly similar to the protein sequences in the BindingDB database that was used for training the binding predictor, spike RBD being more distinct than M pro .

Candidate prioritization from the machine-designed ligand library
The next stage includes in silico screening of generated candidates (Figure 1d) to prioritize them for synthesis and wet lab evaluation.For practical considerations, we sought to keep the number of prioritized machine-designed de novo compounds to be synthesized and tested very small -around 10 for each target, as opposed to screening thousands of chemicals in a more traditional set-up.We used a combination of physicochemical properties (estimated using cheminformatics), target-molecule binding free energy predicted by docking simulations, and retrosynthesis and toxicity predictions by using machine learning.For retrosynthesis prediction, we used the IBM RXN platform 14 that is based on a transformer neural network trained on chemical reaction data.For toxicity prediction, an in-house neural network-based model trained on publicly available in vitro and clinical toxicity data was used.See Methods for details of candidate filtering and prioritization criteria.At the end of the in silico screening, the number of candidates per target is around 100, which was further narrowed down to around 10 per target by using the discretion of Enamine Ltd., the chemical manufacturer.Feasibility of the predicted reaction schema, as evaluated by organic synthetic chemist experts, as well as commercial availability and cost of the predicted reactants, was used to finalize the candidate synthesis list.The final four candidates for each target were chosen based on the synthesis cost and delivery time, as provided by Enamine.

Synthesis of de novo compounds
Figure 2 lists the eight de novo compounds designed by the generative machine learning framework that were synthesized (See Supplementary Tables 2-3 for the predicted molecular properties).Details of the experimental synthesis protocols is provided in Methods and Supplementary Details C.2.We also provide a comparison between the predicted and the actual retrosynthetic pathways for those eight machine-designed compounds in Supplementary Table 4. Five were synthesized using the top predicted pathway of IBM RXN.For two compounds, GEN626 and GEN777, predictions were found to be unsuccessful, so alternative pathways as designed by Enamine were employed (see Methods for details).For GXA104, reactants included in the RXN prediction were not available, so an alternative route was employed.Overall, these results show the usefulness of machine learning-based retrosynthesis predictions for reliably identifying plausible candidates and recommending viable synthesis routes.

Experimental validation of M pro inhibition of de novo and commercially sourced compounds
Enzymatic inhibition by the de novo M pro -specific molecules was measured by solid phase extraction purification linked to mass spectrometry (RapidFire MS) 21 .The results are presented in Figure 3a.Out of the four de novo compounds tested for this target, GXA70 and GXA112 both showed M pro inhibition in the micromolar range, with IC 50 values of 43 µM and 34.2 µM, respectively.This implies a 50% success rate of hit discovery for M pro .
We further tested the generalizability of the pIC 50 predictor (trained directly on the molecular SMILES and protein sequences) by validating predictions on selected commercially available lead-like compounds from the Enamine Advanced Collection 22 .For this purpose, we selected the top three Enamine compounds based on their predicted pIC 50 .One of these Enamine compounds showed inhibition (IC 50 = 35.5 µM).Based on these results we co-crystallised M pro in the presence of this compound (ID Z68337194) and successfully obtained crystals.The crystal structure determined revealed Z68337194 bound in the active site pocket.Structures of the other two commercially available compounds selected based on the pIC 50 predictions were also found bound to the active site of M pro , although these compounds showed no detectable inhibition of M pro using the RapidFire mass spectrometry-based assay.Detailed analysis of the structure obtained for the complex of M pro with Z68337194 (see Figure 3b-d) reveals that the sulphonamide group sits in the P4 subsite 23 and the amine forms an electrostatic interaction with the backbone carbonyl of Glu166.This interaction mimics that made by the P4 site amide of nirmatrelvir (PF-07321332) 18 .Z68337194 occupancy refines to approximately 50%.In the active site, shifts are observed in the positions of Pro168, Leu167, Glu166, and Met165 to accommodate ligand binding.The compound does not sit deeply in the active site and does not interact with the catalytic machinery, providing opportunities to elaborate upon the compound in order to take advantage of further subsites.In the captured crystal form, the active site sits at the interface between symmetry related protein monomers and as a result a symmetry related molecule provides additional interactions -primarily a stacking interaction between the ligand phenylamine ring and Pro252.Additionally, a hydrophobic pocket in the symmetry mate formed primarily by Gln256 and Val297 accommodates the chlorinated ring.

Experimental validation of spike-based pseudovirus and live virus inhibition of de novo compounds
For the CogMol-designed compounds targeting the spike RBD, we measured their neutralization ability using a spike-containing pseudotyped lentivirus and a live viral isolate.These results are summarized in Figure 4. Out of the four candidates, GEN725 and GEN727 showed IC 50 values less than 50 µM (18.7 µM and 2.8 µM, respectively), indicating discovery of novel hits with reasonable inhibition of the pseudovirus at a 50% success rate (Figure 4a).Importantly, GEN727 exhibited live virus neutralization ability as well (Figure 4b).
We further checked if GEN727 is effective across different SARS-CoV-2 variants.We compared the neutralization of viral variants of concern (VOCs) -Alpha, Beta, Delta and Omicron -with neutralization of Victoria (SARS-CoV-2/human/AUS/VIC01/2020), a Wuhan-related strain isolated early in the pandemic from Australia, in both pseudovirus and live virus.Figure 4c shows that GEN727 neutralizes spike-containing pseudovirus across all VOCs with an IC 50 value between 0.7-2.8µM.Live virus data also shows inhibition with an IC 50 of less than 50 µM for Victoria, Alpha, Beta and Delta  (Figure 4d).The virus neutralization results do not demonstrate direct interactions of GEN727 with the spike.To probe this, we performed thermofluor measurements to determine if GEN727 affected the stability of the spike.The presence of the compound appeared to reduce the speed of the transition of the spike to a less stable form; after overnight incubation at pH 7.5, very little of the spike population remained in the more stable form with the higher T m of 65 °C (see Supplementary Figure 9).

Novelty of the de novo designs and comparison with known SARS-CoV-2 inhibitors
In order to characterize the novelty of the de novo bioactive hits, we identified the nearest compound from the PubChem database, in terms of their Tanimoto similarity 24 estimated using Morgan fingerprints 25 .Figure 7 reveals that none of the de novo molecules shares ≥ 0.7 Tanimoto similarity with PubChem molecules.We further computed the Tanimoto similarity of the de novo compounds to known SARS-CoV-2 M pro inhibitors in literature.These results are shown in Table 1.In this category, specifically, we considered the following: an aminipyridine hit identified in the COVID-19 Moonshot initiative 15 , X77 identified using ultralarge docking 16 , the oral inhibitor S-217622 from reference 17 Nirmatrelvir in PAXLOVID 18 , an α-ketoamide inhibitor (Compound 21 from Zhang, et al. 19 ), and Molnupiravir 20 .Consistently, the CogMol-designed inhibitors show high dissimilarity (as indicated by a low Tanimoto similarity around 0.1) to existing SARS-CoV-2 M pro inhibitors.

Insights into the binding mode of the de novo inhibitors
As experimental determinations of the structure of either M pro or the spike protein in complex with the validated de novo inhibitors were not fruitful, we used docking simulations to provide insight into the plausible binding modes.Docking simulations on the generated molecules were performed in the presence of their respective target structure -PDB ID: 6LU7 for M pro and PDB ID: 7Z3Z for spike RBD (See Methods for details).As shown in Figure 5, both machine-designed M pro inhibitors, GXA112 and GXA70, revealed mainly hydrophobic contacts to the residues from the P1 and P2 subsites which are the hotspots of interactions 23 .The hydrogen bonding pattern revealed by the two molecules is, however, starkly different: GXA112 forms hydrogen bonding mainly with P1' site (T25), whereas GXA70 interacts with the P2 residues (D187 and Y54).
The non-extensive and diverse interaction pattern of the de novo and commercially sourced M pro inhibitors reported in this study is consistent with reported observations for non-covalent inhibitors 26 .
For the validated de novo spike inhibitor, docking simulation (see Figure 6) reveals that GEN727 contacts with several hydrophobic residues, such as Tyr365, Tyr369, and Phe374, from RBD.Those residues that constitute the lipid binding pocket of the spike RBD are conserved across seven coronaviruses that infect humans 27 .Also, the docking strikingly recapitulates the binding of the natural lipid (see Figure 6d), suggesting that the lipid binding function maintains the conserved site targeted by GEN727.Therefore, binding of GEN727 might stabilize the closed form of the spike, reducing receptor interactions.In line with that, the thermofluor results (Supplementary Figure 9) showed an (albeit weak) indication that incubation of spike with GEN727 somewhat destabilized the spike, suggestive of a direct interaction underlying its broad-spectrum neutralization ability.

Discussion
The discovery of therapeutic candidates for diseases, including COVID-19, has been greatly advanced by the combined power of numerous in silico approaches.Nevertheless, even the most effective methods face broad challenges that are at the same time inherent to general inverse molecular design tasks and specific to biological target-ligand binding chemistry.The first of these pertains to the vastness of the chemical space being explored and its impact on the throughput and practical utility of prevailing methods.For example, the use of docking or molecular simulation methods to screen on the order of 10 8 to 10 9 commercially available compounds, would incur a prohibitively high computational cost, estimated to reach 10 CPU years 16 per target (as opposed to screening of less than a thousand machine-designed de novo candidates via docking in the present study).The second challenge is availability of critical information: while methods such as pharmacophore modeling and molecular docking have been used successfully in virtual screening or design of molecules 16,19,[28][29][30] , such approaches generally rely upon initial design constructs obtained from available crystal structure(s) of a target protein bound to a candidate compound or fragment hits.Such information is not guaranteed to be available for all drug targets of interest and may take  months to derive, and consequently these approaches are not broadly applicable to the case where such structures are unknown.
In general, reliance on privileged information (the target protein structure and/or known hits), confines the discovery space to the neighborhood of known chemical entities.This dependency therefore presents a practical challenge to expand the accessible chemical exploration space and to devise more readily generalizable approaches to inhibitor design for multiple targets, the structure and binders of which may not be known.This work establishes the basis for an alternative discovery paradigm, wherein a generative model is used to discover novel inhibitor hits for different protein targets efficiently.To our knowledge, this is the first validated demonstration of a single generative model enabling successful and efficient discovery of inhibitor molecules for two different target proteins, based only upon the protein sequence and without the prior knowledge of target-specific ligand binding data or target structure.Previous generative machine learning models that have been subject to experimental validation of de novo-designed molecules were primarily either trained or fine-tuned on a target-specific ligand library 6,7,[31][32][33][34][35] .
The sequence information of new drug targets typically emerges at a much faster (days vs. months) pace than their detailed structural information, thanks to the latest advances in sequencing.As shown in Figure 1 targets shows the potential of a sequence-guided generative machine learning-based framework to help with better pandemic preparedness.
The overall success rate of hit discovery found is 50% for both targets, which required synthesizing and screening only four compounds per target, as opposed to <10% obtained typically using high-throughput screening 1 .Additionally, the validated hits reported in this study appear to be novel, based on molecular similarity analyses with existing chemicals and SARS-CoV-2 inhibitors, indicating impressive creative ability of the generative framework, which is not possible when screening known compounds.The efficiency of hit discovery realized here and the demonstrated generalizability to different targets advocate for pre-training on a large volume of general data, e.g., molecular SMILES, protein sequences, and protein-ligand binding affinities.Conceptually this is a key feature of so-called foundation AI models 36 , which are trained on broad data at scale and can be easily adapted to newer tasks.This perspective is also consistent with recent work, establishing the informative nature of a deep language model trained on large number of protein sequences, in terms of capturing fundamental properties 13,37 .The broad-spectrum efficacy across SARS-CoV-2 VOCs of the most potent spike hit observed is a further example: the VOC sequences were never made available to the generative framework during training or inference.Moreover, to our knowledge, this is the first report of a novel spike-based non-covalent inhibitor that exhibits broad-spectrum antiviral activity.This contrasts with therapeutic monoclonal antibodies, the only drugs currently in use that target the spike protein, where rather few are effective across VOCs 38 .
Taken together, the results presented here establish the efficiency, generality, scalability, and readiness of a generative machine intelligence framework, particularly when combined with autonomous synthesis planning and robotic synthesis and testing 8 , for rapid inhibitor discovery as applied to existing and emerging targets.Such a framework can therefore enhance preparedness for novel pandemics by enabling more efficient therapeutic design.The generality and efficiency of the mechanisms employed in CogMol for precisely controlling the attributes of generated molecules, by plugging in property predictors post-hoc to a learned chemical representation, makes it suitable for broader applications in advancing molecular and material discoveries.For example, the framework has already enabled novel photoacid generator molecule design in a data-efficient manner for performant and sustainable semiconductor manufacturing, which has been validated by subject matter experts 39 .
There of course remains significant scope for improving the discovery power of the framework: incorporation of the 3D structural information (when available) 40,41 and further constraining the generations (e.g.solubility, number of hydrogen bonding donor/acceptor sites, structural diversity) are potential directions for further work.Iterative optimization methods 42 can be adopted to improve initial hits by querying a set of molecular property evaluators along with a retrosynthesis predictor.Active learning paradigms can be also explored for improving process efficiency.

CogMol overview
SMILES VAE as a molecule generator: CogMol leverages a variational autoencoder 10,43 paradigm as the base generative model for molecules.The encoder in the VAE encodes molecules to a latent vector representation.The decoder maps latent vectors back to molecules.New molecules are generated by sampling from the latent space.Here, molecular SMILES is used as the input and output to the encoder and the decoder, respectively.A bidirectional Gated Recurrent Unit (GRU) with a linear output layer was used as an encoder.The decoder contained a 3 layer GRU with a hidden dimension of 512 units and dropout layers with a dropout probability of 0.2.The parameters for the encoder-decoder pair is learned by optimizing a variational lower bound on the log-likelihood of the training data.The loss objective is comprised of a reconstruction loss and a Kullback-Leibler (KL) divergence (a measure of divergence between the fixed prior distribution p(z), standard normal in this case, and the learned distribution q φ (z|x)) term: This implies that new samples can be generated from random points in the latent space, while points close in the latent space will be decoded into chemically similar molecules.
The VAE was first trained for 40 epochs on 1.6M chemical molecules from the MOSES benchmarking dataset 44 , which was chosen from the larger ZINC Clean Leads 45 collection.Then, along with the KL and reconstruction loss, the VAE was also jointly trained for another 15 epochs to predict the molecular attributes QED and synthetic accessibility (SA) from the latent vectors z.Two separate linear regression models were trained, such that the VAE latent space becomes organized based on those physical properties and thus serves as an approximation of the joint probability distribution of molecular structure and the chemical properties 46 .The training was further continued for a final 5 epochs on around 211k ligand molecules from the BindingDB database 47 .This paradigm therefore served as a molecule generator that is unbiased toward any particular target.
The final VAE generates SMILES strings by sampling from q φ (z|x) that are 99% unique and exhibit greater than 90% chemical validity, while root-mean-square errors (RMSE) on the QED and SA prediction are 0.0281 and 0.0973, respectively.

11/32
Molecular attribute predictors for conditional generation: Two predictors trained on the latent z vectors were used for target-specific inhibitor molecule design, which are also drug-like.The QED regressor was comprised of 4 hidden layers with 50 units each and ReLU nonlinearity.Further, a target-chemical binder (strong/weak) predictor was trained on the latent z vectors of chemicals and the pretrained protein sequence embeddings 12 , which used the data released as part of the DeepAffinity 48 .A pIC 50 value of > 6 was used as a threshold to decide if a compound was a strong binder.The protein embeddings and the molecular embeddings were concatenated and passed through a single hidden layer with 2048 units and ReLU nonlinearity.The z-based QED and pIC 50 predictors yield an RMSE of 0.0281 and 1.282, respectively.These set of predictors were used for controlled sampling from the VAE model to design molecules with desired attributes.
CLaSS sampling used for conditional generation in CogMol: We briefly describe Conditional Latent (attribute) Space Sampling (CLaSS) 13 here.CLaSS uses (i) a density model of the VAE latent representation, and (ii) a set of molecular attribute predictors trained on the VAE latent vectors, to generate molecules in an attribute-controlled manner.For this purpose, a rejection sampling approach utilizing Bayes' theorem is used.To elaborate further, first an explicit density model is learned on the latent embeddings of the training data to ensure sampling is uniformly random.A Gaussian mixture model with 100 components and diagonal covariance matrices was used for this purpose.Assuming the attributes are all independent of each other and can be conditioned on the latent embeddings (i.e., the latent space encompasses all combinations of attributes), Bayes' rule was then used to define the conditional probability of a sample, given certain properties in terms of the predictor models above.Finally, we employ this definition in a rejection sampling scheme, such that samples drawn from the density model are accepted according to the product of the attribute predictor scores.For more details on the algorithm, see Supplementary Details C.1.Generating the 875k samples for each target took around two days using an NVIDIA Tesla K80 GPU.

Ranking and prioritization
The filtering criteria included molecular weight (MW) less than 500 Da, QED greater than 0.5, SA less than 5, and octanol-water partition coefficient (logP) less than 3.5.MW, SA, logP, and QED were calculated using the RDKit toolkit 49 .A pIC 50 predictor trained on DeepAffinity 48 data was also used for ranking the designed molecules based on predicted affinity (AFF).A SMILES-based binding affinity (pIC 50 ) predictor was used for this purpose.SMILES sequences were first embedded using long short-term memory units (LSTMs).Those SMILES embeddings were then concatenated with pre-trained protein embeddings 12 , resulting in RMSE of 0.8426 on the test data.A threshold for predicted pIC 50 affinity with the respective target sequence was set -greater than 8 for molecules targeting M pro and greater than 7 for molecules targeting the spike RBD.This affinity predictor was also used to estimate target selectivity (SEL) 9 , defined as the excess affinity to the target compared to a random set of proteins, lack of which is a known cause for drug candidate failure.
The molecules were also evaluated for predicted toxicity 50 across a total of 12 in vitro 51 and one clinical end-points 52 .Morgan fingerprints were used as the input features for the toxicity prediction model.A multitask deep neural network containing a total of four hidden layers was used 50 : two layers were shared across all toxicity endpoints and two were specific to each of the endpoints.A ReLU activation were used for all layers except for the last, for which a sigmoid activation was used.Molecules that were predicted to have no toxicity to any of the toxicity endpoints were progressed in the workflow.
We then ran docking simulations on a prioritized set of designed molecules, less than 1000, with their respective target structures, as the docking energy can provide an indication of actual inhibition.For M pro , we used a monomer from the first structure determined and deposited with the Protein Data Bank for SARS-CoV-2 M pro complexed with the covalent inhibitor N3 (PDB ID: 6LU7 23 ) and set the search space to fully encompass the receptor.For spike, we used a lipid-bound conformation (PDB ID: 7Z3Z) and kept the protomer frozen during docking, as the goal is to find molecules that dock to the lipid-bound spike RBD.Our intent was to exploit the lipid binding pocket for developing inhibitors that can trap the spike protein in the closed conformation as this is known to have reduced interaction with the host ACE2 receptor 27,53 .Docking was performed using AutoDock Vina 54 run blindly over the entire protein structure with an exhaustiveness of 8, and repeated 5 times to find the optimal conformation.Compounds with a binding free energy given by docking of less than −8.4 kcal/mol with M pro were selected.For the generated spike compounds, we prioritized those that exhibited a binding free energy less than −7.5 kcal/mol.Further, we only considered the compounds were docked less than 3.9 Å from the lipid binding pocket in the final docked configurations.The surface and ribbon representations of ligands docked (or bound) to the target structure were produced with PyMol 55 and the protein-ligand interaction plots were produced with LigPlot+ 56 .
In contrast with large-scale screening techniques, docking is only used to provide additional validation of the binding affinity predictor model and therefore can be run after filtering candidates based on the easily computed properties described above.After this filtering, we were left with fewer than 1000 molecules combined between the two targets on which to run docking.Each simulation takes only a few minutes and can be run independently in parallel which means the entire in silico screening can be performed in less than a day when run on a compute cluster consisting of Intel Xeon E5-2600 v2 processors.

Retrosynthesis prediction
We assessed synthesis plausibility for the novel compounds, as a major challenge in driving successes in molecular discovery is to devise plausible and efficient synthesis-planning protocols.Here we applied the recent advances made by machine learning-based approaches to predict retrosynthetic routes from large reaction databases.To estimate the ease of synthesizability and facilitate synthesis planning of the selected compounds, we predicted the retrosynthesis pathways for each candidate using the IBM RXN platform 14 .RXN combines a transformer neural network for forward reaction prediction and graph exploration techniques to evaluate retrosynthesis paths, scoring them according to probability.The path is terminated when all reagents are found to be commercially available.Candidates for which RXN was unable to determine a feasible retrosynthesis route or which terminated with non-commercially available compounds were removed from consideration.For each prediction we used the following parameters: maximum single step reactions (depth), 6; minimum acceptance probability for a single step, 0.6; maximum number of pathways (beams), 10; number of steps between removal of low probability steps (pruning), 2; and maximum execution time, 1 hour.Commercial availability was determined by searching the eMolecules database 57 with a restriction on lead time of 4 weeks or less but no restriction on price.
In the next section, we provide a detailed comparisons between predicted retrosynthesis and actual synthesis routes, which is also summarized in Supplementary Table 4.We considered three main aspects in the comparison: number of reaction steps leading to the final product, overlap of the products in the intermediate reaction steps, and overlap of reactants used in the reactions.We chose the best path from the top six predicted for comparison by optimizing first for product overlap and then for reactant overlap.Overall, the total number of actual reaction steps showed good agreement with predictions, generally only off by one or two steps.This was confirmed by the overlap of intermediate products, which showed that retrosynthesis often predicted the correct high-level path.Product overlap is highly variable, though, since there are relatively few per route (often only two or three).The actual synthesis routes even used many of the same reactants as predicted, although occasionally alternatives had to be found due to stock limitations.In general, the retrosynthesis prediction was used as a starting point and any "major" deviations required were considered a failure.

Synthesis protocols
In this section, we compare the retrosynthesis predictions to the actual routes used to synthesize the molecules: GEN727 was synthesized according to the best RXN-predicted method (Figure 8a).The synthesis of GEN725 was carried out by analogy to the best RXN strategy (Figure 8b).SNAr ester synthesis in DMF, gave intermediate compound 13 with high yield.Cross-coupling of 13 with sulfonamide-pinacolborane led to the final product with a moderate yield (see Supplementary Details C.2 steps K-L for full details of the synthesis procedure).Several unsuccessful attempts were made to carry out the first step according to the retrosynthetic strategy for GEN626, which led to obtaining the desired intermediate with very low yield.As a result, the synthetic pathway was changed (Figure 8c).SNAr reaction was carried out with cyanide 8, which was followed by hydrolysis of intermediate compound 10 (obtained with a moderate yield).Reduction of nitro-group of 11 led to GEN626 (see Supplementary Details C.2 steps H-J).Unfortunately, following the pathway suggested by retrosynthesis for GEN777 didn't give good results and the synthetic strategy needed to be changed (Figure 8d).We synthesized acyl chloride 5, which reacted with methyl amine on the next step.Thereafter, amide 6 was treated by PCl 5 and the resulting intermediate was reacted in situ with azide-anion (see Supplementary Details C.2 steps D-G).
Enamine did not have boc-amino pinacolborane 20 in stock and could not follow the proposed retrosynthetic strategy for GXA104 (Figure 8e).Unprotected amino-pinacolborane was available and so the strategy was changed, which made it possible to obtain GXA104 in fewer steps.At first, 20 was reacted with carboxylic acid 19, which led to amide 21.Cross-coupling of 21 with 3-iodo-1H-indazole led to GXA104 (see Supplementary Details C.2 steps P-Q).GXA56 was synthesized according to the top RXN-predicted method (Figure 8f).GXA70 was synthesized by analogy to the best RXN-predicted method (Figure 8g).Minor modifications were made to the synthetic steps, such as use of other bases and organic solvents (not significant for a whole scheme).The RXN strategy was chosen due to high reactivity of trichlorotriazine with amines and the need to substitute only one chlorine at the first stage (it is easier to be controlled with less nucleophilic aniline compared to more nucleophilic aliphatic secondary amines).
The RXN-predicted strategy for GXA112 was followed as closely as possible.The last synthetic step (reaction with SO 2 (NH 2 ) 2 ) led to the final product with very low yield.To improve it, mono-Boc-protected SO 2 (NH 2 ) 2 was synthesized and reacted with 26.Boc-protected final product 30 was obtained and readily deprotected via TFA cocktail (see Supplementary Details C.2 steps V-X).

Cloning, protein production, and crystallization
M pro production: The M pro coding sequence was codon optimised for expression in E. coli and synthesised by Integrated DNA technologies (IDT).The M pro expression construct used for crystallization comprises an N-terminal GST region, an M pro autocleavage site, the M pro coding sequence, a hybrid cleavage site recognizable by 3C HRV protease and a C-terminal 6-Histidine tag 58  pGEX-6P-1 (Sigma).Protein expression, purification and crystallisation was carried out in similar conditions to those previously described in Douangamath, et al. 59 .Specifically, crystals were obtained from 0.1 M MES pH 6.5, 15 PEG4K, 5% DMSO using drop ratios of 0.15 µl protein, 0.3 µl reservoir solution and 0.05 µl seed stock.Genetic constructs of spike ectodomain: The gene encoding amino acids 1-1208 of the SARS-CoV-2 spike glycoprotein ectodomain, with mutations of RRAR > GSAS at residues 682-685 (the furin cleavage site) and KV > PP at residues 986-987, as well as inclusion of a T4 fibritin trimerisation domain, a HRV 3C cleavage site, a His-8 tag and a Twin-Strep-tag at the C-terminus, as reported by Wrapp, et al. 60 .All vectors were sequenced to confirm clones were correct.Spike protein production: Recombinant spike ectodomain was expressed by transient transfection in HEK293S GnTI-cells (ATCC CRL-3022) for 9 days at 30 °C.Conditioned media was dialysed against 2x phosphate buffered saline pH 7.4 buffer.The spike ectodomain was purified by immobilized metal affinity chromatography using Talon resin (Takara Bio) charged with cobalt followed by size exclusion chromatography using HiLoad 16/60 Superdex 200 column in 150 mM NaCl, 10 mM HEPES pH 8.0, 0.02% NaN 3 at 4 °C.

X-ray screening of M pro binding compounds
Compounds were dissolved in DMSO and directly added to the crystallization drops giving a final compound concentration of 10 mM and DMSO concentration of 10%.The crystals were left to soak in the presence of the compounds for 1-2 hours before 14/32 (e) GXA104 (f) GXA56 (g) GXA70

Figure 8 (continued).
Comparison between actual and predicted synthesis routes.For each subfigure, the top reaction (enclosed in a box) is the actual synthesis procedure used in this study while the bottom reaction is the predicted retrosynthetic pathway.
being harvested and flash cooled in liquid nitrogen without the addition of further cryoprotectant.X-ray diffraction data were collected on beamline I04-1 at Diamond Light Source and automatically processed using the Diamond automated processing pipelines 61 .Analysis was performed as outlined previously 59 .Briefly, XChemExplorer 62 was used to analyse each processed dataset that was automatically selected and electron density maps were generated with Dimple 63 Ligand-binding events were identified using PanDDA 64 , and ligands were modelled into PanDDA-calculated event maps using Coot 65 .Restraints were calculated with AceDRG 66 or GRADE 67 , structures were refined with Refmac 68 and Buster 69 and models and quality annotations cross-reviewed.

Figure 8 (continued).
Comparison between actual and predicted synthesis routes.For each subfigure, the top reaction (enclosed in a box) is the actual synthesis procedure used in this study while the bottom reaction is the predicted retrosynthetic pathway.
The peptide/protein sample was loaded onto a solid-phase extraction (SPE) C4-cartridge, AND washed with 0.1% (v/v) aqueous formic acid to remove non-volatile buffer salts (5.5 s, 1.5 mL/min) prior to elution with aqueous 85% (v/v) acetonitrile containing 0.1% (v/v) formic acid (5.5 s, 1.25 mL/min).The cartridge was re-equilibrated with 0.1% (v/v) aqueous formic acid (0.5 s, 1.25 mL/min) and sample aspirator washed with an aqueous, organic and aqueous wash before the injection of next protein: peptide mixture sample onto the SPE cartridge.Data were extracted with Rapid Fire integrator software (Agilent) and m/z (+1) was used for both N-terminal fragment TSAVLQ (681.34Da), and the 11-mer substrate peptide (1191.68Da).The percentage M pro activity (N-terminal product peak integral/ (N-terminal product peak integral + substrate peak integral) *100) was calculated in Microsoft Excel and normalised data transferred to Prism 9 for non-linear regression curve analysis).IC 50 -values are reported as the mean of technical duplicates (n = 2; mean ± SD).Signal to noise (S/N) and Z'-factor were calculated in Microsoft Excel (Z'> 0.8) 21 .

Spike thermal shift-based binding assay
Thermofluor (differential scanning fluorimetry, DSF) experiments were performed in triplicate in 96-well white PCR plates using a 1300-fold excess of small molecule (in DMSO) to 1.5 µg spike monomer in 50 µL buffer per well.An Agilent MX3005p RT-PCR instrument (λ ex 492 nm/λ em 585 nm) was used to monitor the fluorescence change of a 3x final concentration of SYPRO Orange dye (Thermo) in an "increasing-sawtooth" temperature profile where the temperature was increased in 1 °C increments from 25 °C to 98 °C with the fluorescence recorded at 25 °C.Four of the synthesised compounds were investigated using Thermofluor assay to assess effect upon stability.Several conditions were tested: in 20 mM sodium acetate pH 4.6 150 mM NaCl, a storage buffer at which long term stability was observed to be much improved 70 ; in 50 mM HEPES pH 7.5, 200 mM NaCl immediately after buffer exchange from the storage buffer; after incubation overnight at pH 7.5; and after incubation overnight at pH 7.5 in the presence of the compound.Raw fluorescence data were analysed using Microsoft Excel and the JTSA software 71 using a 5-parameter model to produce melting temperature (T m ) values.Note that fresh spike protein exhibits a single melting transition which can be characterised as a melting point, T m , of 65 °C in neutral pH buffer.At a reduced pH 4.6 the single melting transition is at 62 °C.As spike is incubated in pH 7.5 a second transition appears at a lower temperature with a T m of 50 °C.This transition increases as a proportion of the total melt until it is the only transition observed and correlates with a presumed conformational change of the spike trimer to a less stable form.

Focus reduction neutralization assay (FRNT) for measuring SARS-CoV-2 live virus neutralization of spike RBD-targeting compounds
Vero-CCL-81 cells (100,000 cells per well) were seeded in a 96-well, cell culture-treated, flat-bottom microplates for 48 hrs.Compounds were serially diluted and incubated with approximately 100 foci of SARS-CoV-2 for 1 hr at 37 °C.The mixtures were added on cells and incubated for further 2 hrs at 37 °C followed by the addition of 1.5% semi-solid carboxymethyl cellulose (CMC) overlay medium to each well to limit virus diffusion.Twenty hours after infection, cells were fixed and permeabilized with 4% paraformaldehyde and 2% Triton-X 100, respectively.The virus foci were stained with human anti-NP mAb (mAb206) and peroxidase-conjugated goat anti-human IgG (A0170; Sigma), and visualized by adding TrueBlue Peroxidase Substrate.Virus-infected cell foci were counted on the classic AID EliSpot reader using AID ELISpot software.The percentage of focus reduction was calculated by comparing the number of foci in treated wells with the number in untreated control wells and IC 50 was determined using the probit program from the SPSS package.

Pseudoviral neutralization assay for measuring inhibition of SARS-CoV-2 pseudovirus entry of spike RBDtargeting compounds
Pseudotyped lentiviral particles expressing SARS-CoV-2 S protein were incubated with serial dilutions of compounds in white opaque 96-well plates for 1 hr at 37 °C.The stable HEK293T/17 cells expressing human ACE2 were then added to the mixture at 15000 cells per well.Plates were spun at 500 RCF for 1 min and further incubated for 48 hrs.Finally, Culture supernatants were removed followed by the addition of Bright-GloTM Luciferase assay system (Promega, USA).The reaction was incubated at room temperature for 5 mins and the firefly luciferase activity was measured using CLARIOstar ® (BMG Labtech).The percentage of neutralization of compounds towards pseudotyped lentiviruses was calculated relative to the untreated control and IC 50 was determined using the probit program from the SPSS package.Require: Trained latent variable model (e.g.VAE), samples z j drawn from domain of interest, labeled samples for each attribute a i .1: Encode training data x j in latent space: z j,k ∼ q φ (z|x j ) for k = 1, ..., K 2: Use z j,k to fit explicit density model Q ξ (z) to approximate marginal posterior q φ (z) 3: Train classifier models q ξ (a i |z) using labeled samples for each attribute a i to approximate probability p(a i |x) 4: Assuming attributes a i are conditionally independent given z, then

A.3 Crystallography
if Accepted then Step A: A mixture of compound 1 (0.5 g, 2.2 mmol), propargyl bromide (0.4 g, 3.3 mmol) and potassium carbonate (0.6 g, 4.4 mmol) was suspended in acetonitrile (20 mL) and the reaction mixture was heated to 60 °C for 18 h.The solids were removed via filtration and the solvent was removed in vacuo.The residue was diluted with an aqueous NaHSO 4 solution (50 mL) and washed with dichloromethane (2 × 20 mL); the aqueous layer was basified with NaOH to pH=14, and extracted with dichloromethane (3 × 30 mL).The organic extracts were combined, dried over Na 2 SO 4 and concentrated in vacuo to obtain crude 2 (0.4 g) which was used in the next step without purification.
Step B: Crude compound 2 (0.4 g) was dissolved in methanol (10 mL) and a hydrogen chloride solution in dioxane (20 mL) was added.The reaction mixture was stirred for 18 h at 20 °C.The volatiles were removed in vacuo to obtain crude 3 (0.32 g) as a hydrochloride salt.

Figure 12. GEN777 synthesis route
Step D: Thionyl chloride (3 g, 25.2 mmol) was added to a solution of compound 4 (1.7 g, 6.6 mmol) in dichloromethane (10 mL) and the mixture was refluxed for 1 h and evaporated under reduced pressure to give compound 5.
Step E: To a saturated solution of aqueous methylamine (5 g), cooled to 0 °C, was added compound 5 (1.8 g, 7.9 mmol).After the completion of the reaction was confirmed, the resulting mixture was extracted with MTBE.The combined organic layers were washed with brine dried over anhydrous Na 2 SO 4 and evaporated under reduced pressure to obtain 1 g of compound 6, which was used in the next step without further purification.
Step F: To a solution of compound 6 (1 g, 4.5 mmol) in dichloromethane (700 mL) was added PCl 5 (1.4 g, 6.72 mmol).The reaction mixture was stirred for 2 h at r.t. to obtain the solution contained compound 7 which was not isolated but directly used in the next step.
Step G: To the solution of compound 7 in dichloromethane (from Step F) was added TMSN3 (2.5 g, 21.7 mmol).The reaction mixture was stirred overnight at r.t. and evaporated under reduced pressure.The residue was purified by HPLC to give 0.130 g of GEN777. 28/32

Figure 13. GEN626 synthesis route
Step H: To a solution of compound 9 (0.55 g, 3 mmol) in dry DMF (15 mL), sodium hydride (as 60% suspension in mineral oil, 0.132 g, 3.3 mmol) was added in one portion.The mixture was stirred at 40 °C for 30 min and compound 8 (0.5 g, 3 mmol) was added.The reaction mixture was stirred at 20 °C for 18 h, diluted with water (100 mL), and extracted with ethyl acetate (3 × 30 mL).The combined organic layers were washed with water (4 × 50 mL), dried over Na 2 SO 4 and concentrated in vacuo to obtain the crude material which was purified via column chromatography (CHCl 3 :MeOH 10:1 as eluent) to afford 10 (0.18 g, 0.55 mmol, 18% yield) as yellow oil.
Step I: Compound 10 (0.18 g, 0.55 mmol) was suspended in conc.H 2 SO 4 (5 mL) and the reaction mixture was heated to 60 °C for 2 h, cooled with ice and diluted with an aqueous Na 2 CO 3 solution to basic pH.The resulting mixture was extracted with ethyl acetate (3 × 30 mL); the organic layer was dried over Na 2 SO 4 and concentrated in vacuo to obtain 11 (0.16 g, 0.46 mmol, 84% yield) as yellow solid.
Step J: To a solution of compound 11 (0.16 g, 0.46 mmol) in methanol (10 mL), Pd/C (10%w, 0.100 g) was added.The reaction mixture was evacuated and backfilled with hydrogen and then stirred for 18 h.The catalyst was removed via filtration and the solvent was removed in vacuo to obtain the crude material which was purified via preparative HPLC to obtain GEN626 (0.0614 mg, 42% yield) as white solid.

Figure 14. GEN725 synthesis route
Step K: To a suspension of NaH (0.250 g, 6.31 mmol, 60% dispersion in mineral oil) in DMF (5 mL) was added dropwise a solution of 4-bromophenol (1.09 g, 6.31 mmol) in DMF (5 mL).The mixture was stirred for 1 h and compound 12 (1 g, 5.74 mmol) was added.The reaction mixture was stirred at 100 °C overnight, cooled to r.t. and poured into ice (100 mL).The precipitate was filtered and washed with water (3 × 10 mL) and with hexanes.The solid was dried in vacuo to give 13 (1.72 g, 92%).
Step P: To a solution of compound 19 (0.975 g, 5. Step Q: A solution of compound 21 (0.410 g, 1.06 mmol), 3-iodo-1H-indazole (0.259 g, 1.06 mmol), Pd(PPh 3 ) 4 (0.061 g, 0.05 mmol) and Na 2 CO 3 (0.225 g, 2.13 mmol) in a mixture of dioxane/water (4:1) (5 mL) was stirred overnight at 90 °C under an argon atmosphere.The cooled mixture was diluted with water and extracted with dichloromethane.The combined organic layers were washed with water, dried over anhydrous Na 2 SO 4 and evaporated under reduced pressure.The residue was purified by column chromatography to obtained by HPLC to afford 0.180 g of compound GXA104 (45% yield).
Step R: To a stirred solution of compound 22 (2 g, 11 mmol) in dichloromethane (40 mL) at 0 °C were added DIPEA (2.3 mL, 13.2 mmol) and 2,3-dihydro-1H-indole (1.22 mL) and the resulting mixture was stirred at r.t. for 16 h.After that the reaction mixture was diluted with water; the organic phase was washed with water and brine, dried over Na 2 SO 4 and evaporated to obtain crude product 23 (1.1 g), which was used in the next step without further purification.
Step S: To a stirred solution of compound 23 (1.1 g, 4 mmol) in dichloromethane (40 mL) at 0 °C were added DIPEA (0.86 mL, 4.94 mmol) and tert-butyl N-[4-(methylamino)cyclohexyl]carbamate (0.94 g) and the resulting mixture was stirred at r.t. for 16 h.After that the reaction mixture was diluted with water; the organic phase was washed with water and brine, dried over Na 2 SO 4 and evaporated under reduced pressure to obtain crude product 24 (1.5 g), which was used in the next step without further purification.
Step T: To a stirred solution of compound 24 (1.5 g, 3 mmol) in dichloromethane (30 mL) at r.t. were added DIPEA (0.68 mL, 3.90 mmol) and morpholine (0.28 mL, 3.25 mmol) and the resulting mixture was stirred at r.t. for 16 h.After that an additional amount of DIPEA (0.68 mL, 3.90 mmol) and morpholine (0.28 mL, 3.25 mmol) was added and the resulting mixture was stirred at r.t. for another 16 h.Then the reaction mixture was diluted with water; the organic phase was washed with water and brine, dried over Na 2 SO 4 and evaporated under reduced pressure to obtain crude product 25 (1.7 g), which was used in the next step without further purification.
Step U: To a stirred solution of compound 25 (1.7 g, 3 mmol) in dichloromethane (25 mL) was added 4 M HCl solution in dioxane and the resulting mixture was stirred at r.t. for 8 h.After that the reaction mixture was evaporated under reduced pressure to obtain crude product 26 (1.2 g), which was used in the next step without further purification.
Step V: To a stirred solution of compound 27 (0.7 mL, 7.4 mmol) in diethyl ether (10 mL) was added compound 28 (0.15 mL, 0.243 g, 1.7 mmol) at −78 °C and the resulting mixture was stirred at r.t. for 1 h.The reaction mixture was evaporated without heating to obtain crude product 29, which was immediately used in the next step.
Step W: To a stirred suspension of compound 26 (0.8 g, 1.7 mmol) in dichloromethane (10 mL) at 0 °C was added Et 3 N (0.76 mL, 5.45 mmol) followed by a solution of compound 29 in dichloromethane (3 mL) and the resulting mixture was stirred at r.t. for 16 h.After that the reaction mixture was diluted with water; the organic phase was washed with water and brine, dried over Na 2 SO 4 and evaporated under reduced pressure to obtain crude product 30 (0.8 g), which was used in the next step without further purification.
Step X: To a stirred solution of compound 30 (0.8 g, 1.4 mmol) in dichloromethane (5 mL) was added 4 M HCl solution in dioxane (1 mL) and the resulting mixture was stirred at r.t. for 8 h.Then the reaction mixture was evaporated under reduced pressure, the obtained residue was diluted with water, basified with a NaHCO 3 solution and extracted with dichloromethane.The combined organic phase was washed with water, dried over Na 2 SO 4 and evaporated under reduced pressure to obtain crude product.The crude product was purified by HPLC to obtain 0.01 g of GXA112.

Figure 1 .
Figure 1.Overview of our inhibitor discovery workflow driven by CogMol, a sequence-guided deep generative framework.(a-b) illustrate molecular VAE training on large-scale chemical SMILES (x) data and mapping of the existing protein-ligand affinity relations on the VAE latent space (z) by training a binding predictor, respectively.For the latter, we leverage pre-trained neural network (NN) embeddings of a large volume of protein sequences.(c) shows Controllable Latent Space Sampling or CLaSS, which samples from the model of VAE latent vectors by using the guidance from a set of molecular property predictors (e.g., protein binding), such that for a given target protein sequence, sampled z vectors corresponding to strong target binding affinity are accepted while vectors corresponding to weak target binding affinity are rejected.The accepted z vectors are then decoded into molecular SMILES.(d) Candidates are then ranked and filtered according to chemicalproperties, docking score to target structure, and predicted retrosynthetic feasibility and toxicity.(e) A small set of prioritized molecules are synthesised, followed by wet lab testing in specific in vitro assays to confirm target inhibition.(f) In the present case, for each target, of the four molecules tested, two showed promising levels of inhibition.We also report approximate sample sizes and timeline for each stage of our discovery workflow.Note the timeline does not include the training and testing of the generative and predictive machine learning models.

Figure 2 .
Figure 2. De novo-designed and commercially sourced molecules.Molecules with the prefix "Z" are molecules from the Enamine Advanced Collection catalog targeting M pro (top).Molecules with the prefix "GXA" are generated candidates targeting M pro (middle) while those with the prefix "GEN" are generated candidates targeting the spike RBD (bottom).

Figure 3 .
Figure 3. Inhibition of SARS-CoV-2 M pro by machine-designed de novo and commercially sourced compounds.(a) Half maximal inhibitory concentration (IC 50 ) from RapidFire MS experiments for de novo and commercial M pro inhibitor candidates.A "-" indicates no inhibition was detected.Candidates marked with † had successful crystal structures determined.(b-d) Crystal structure of the SARS-CoV-2 M pro in complex with Z68337194.(b) Ribbon representation with transparent surface of the M pro dimer coloured in wheat and light pink to delineate each protomer.The active site of each protomer is shown with Z68337194 in stick representation.(c) Surface representation showing the overall binding mode of Z68337194 at the active site of M pro .(d) Schematic representation of the interactions of Z68337194 with M pro .Residues indicated with * are from a symmetry related M pro protomer.

Figure 4 .
Figure 4. SARS-CoV-2 spike neutralization assays.Neutralization assay against SARS-CoV-2 pseudotyped lentivirus (a) and Victoria live virus (b) for four CogMol-generated compounds with DMSO as a control.(c) The most effective compound, GEN727, was selected for a pseudoviral neutralization assay against Victoria, Alpha, Beta, Gamma, Delta and Omicron variants of concern (VOCs), as well as (d) the live-virus neutralization assay.Error bars show the standard error of each measurement over two trials.

23 7/ 32 (Figure 5 .
Figure 5. Docked structures of SARS-CoV-2 M pro with GXA112 and GXA70.Surface representation depicting the overall ligand binding modes of (a) GXA112 and (c) GXA70 at the active site of M pro .Schematic representation of the ligand interactions with M pro for (b) GXA112 and (d) GXA70.

Figure 6 .Figure 7 .
Figure 6.Docked structure of SARS-CoV-2 spike protein RBD in complex with GEN727.(a) illustrates the ribbon representation with transparent surface of the spike trimer.Wheat, gray, and light pink color is used to delineate each protomer.GEN727 (shown in stick representation) docked to a spike monomer structure is superimposed for reference.(b) Surface representation depicting the overall docking pose of GEN727 at the lipid binding site of the spike RBD.(c) A schematic of GEN727 interacting with the RBD.(d) Docked GEN727 in reference to stearic acid lipid bound to the spike RBD.
, it took us less than a week to design and prioritize the set of candidate molecules to be synthesized and tested in wet lab for the two SARS-CoV-2 targets.The information on SARS-CoV-2 sequences was made publicly available starting around January of 2020 and CogMol-designed candidates were open-sourced in the IBM COVID-19 Molecule Explorer platform in March 2020 while the first round of wet lab validation was completed in June 2020.This rapid pace of novel drug-like inhibitor discovery across two distinct drug 10/32

Figure 8 .
Figure 8.Comparison between actual and predicted synthesis routes.For each subfigure, the top reaction (enclosed in a box) is the actual synthesis procedure used in this study while the bottom reaction is the predicted retrosynthetic pathway.

Figure 9 .
Figure9.Thermofluor assay results.Thermofluor raw fluorescence data for experiments with AI-designed compound GEN727 (black) and a DMSO control (grey).Data were recorded using protein that was used immediately after dilution into neutral buffer (solid lines), incubated overnight in neutral buffer (long-dashed lines), or incubated overnight with the compound in neutral buffer (short-dashed lines).For comparison, data from protein in pH 4.6 buffer is also shown (dotted lines).

Figure 10 .Algorithm 1
Figure 10.Docked structure of SARS-CoV-2 spike protein RBD in complex with GEN725.(a) Surface representation depicting the overall ligand binding modes of GEN725 at the lipid binding site of the RBD.(b) Schematic representation of the ligand interactions with the spike RBD.

Table 1 .
Molecular similarity with existing inhibitors.Tanimoto similarity of the validated machine-designed de novo candidates to existing SARS-CoV-2 M pro inhibitors.

Table 5 .
Crystallographic data collection and refinement statistics.Values in parentheses refer to the highest resolution shell.

Table 6 .
SARS-CoV-2 target protein sequences.The amino acid sequences of the protein targets used in the generation pipeline