PIDiff: Physics Informed Diffusion Model for Protein Pocket-Specific 3D Molecular Generation ⋆

Recentremarkableadvancementsingeometricdeeplearninghaveledtounprece-dentedprogressinthree-dimensional(3D)generationofligandsthatcanbindtotheproteinpocket


Introduction
Structure-based drug design (SBDD), an approach wherein drugs are developed leveraging protein structures, holds great promise, considering that drugs are substances capable of binding to protein structures to either inhibit or activate specific functions (Anderson, 2003).Accordingly, numerous extensive studies have been conducted to identify ligands that can bind to target proteins (Blundell, 1996;Lyne, 2002;Shoichet, 2004;Pagadala, Syed and Tuszynski, 2017;Hansson, Oostenbrink and van Gunsteren, 2002).However, traditional methods encounter substantial hurdles in drug development, primarily owing to the vast chemical space and computationally-demanding tasks, resulting in high costs and prolonged timelines.To address these challenges, the field of generative models has extensively explored molecule generation, traditionally representing them in onedimensional (strings) or two-dimensional (topology) formats (Cheng, Gong, Liu, Song and Zou, 2021).Nevertheless, these approaches have limitations in accurately depicting molecules in three-dimensional (3D) space.They fail to capture the intricate atomic interactions within the proteinligand binding pocket (Xie, Wang, Li, Lai and Pei, 2022).Consequently, this shortfall hinders the accurate prediction of molecules that effectively bind to specific target proteins.Grisoni and Schneider, 2021;Isert, Atz and Schneider, 2023) for molecular structures have spurred a shift in molecular generation research toward the utilization of geometric 3D representations, surpassing the earlier string-and topologybased models.Generating the desired molecules based on a 3D representation typically entails modeling both continuous variables (such as the position of atoms) and discrete variables (like the types of atoms).This transition to 3D methodologies facilitates a direct examination of how molecules interact within protein pockets, offering a more realistic and logical strategy for specific scenarios.
Particularly in 3D object modeling, the adoption of SE(3)equivariant neural networks, which incorporate an inductive bias to adeptly navigate the vast search space created by object translation and rotation, has unlocked unprecedented potential in 3D molecular generation Satorras, Hoogeboom and Welling (2021); Xu, Yu, Song, Shi, Ermon and Tang (2022).Additionally, the application of denoising diffusion probabilistic model (DDPM) (Ho, Jain and Abbeel, 2020;Nichol and Dhariwal, 2021), which has shown impressive results in image generation, is emerging as a successful approach in the challenging task of 3D molecular generation (Hoogeboom, Satorras, Vignac and Welling, 2022;Guan, Qian, Peng, Su, Peng and Ma, 2023).Building on previous research, these advancements have laid the foundation for a more rational approach to utilizing geometric information of molecules in drug design.Nevertheless, when the objective is to generate molecules with the capability to bind to a protein, the binding principles between the protein and ligand need to be reconsidered once again.The binding of a small molecule to a protein pocket implies a greater stability compared to its unbound state.
This can be elucidated by the physicochemical principle wherein the binding free energy attains its minimum when the protein and small molecule are combined (Zhou and Gilson, 2009;Luque and Barril, 2012).Therefore, recalling these fundamental principles, designing molecular generation models to learn only the geometric shapes of molecules, as done in previous studies, could be a superficial approach that overlooks the essential principles underlying the problem we seek to solve.Such an approach risks overlooking deep and more critical aspects of molecular interactions and binding processes.
Another important aspect to consider in 3D molecular generation is the dataset utilized.Specifically, in studies focusing on protein-aware 3D molecular generation, the pdb format (Berman, Westbrook, Feng, Gilliland, Bhat, Weissig, Shindyalov and Bourne, 2000) is commonly used.This data type represents proteins and ligands, which vibrate according to thermodynamic principles Fischer, Coleman, Fraser and Shoichet (2014), as fixed relative atomic positions in three-dimensional coordinates.This can be explained by the fact that, whether obtained through crystallization methods such as X-ray crystallography MacArthur, Laskowski and Thornton (1994), predicted by AlphaFold Jumper, Evans, Pritzel, Green, Figurnov, Ronneberger, Tunyasuvunakool, Bates, Žídek, Potapenko et al. (2021), or derived from protein-ligand complexes obtained via Crossdocking Francoeur, Masuda, Sunseri, Jia, Iovanisci, Snyder and Koes (2020), these structures are all subject to constant vibration in real environments, yet are represented as fixed three-dimensional coordinates.Consequently, such data have limitations in capturing the dynamic and constantly changing real-world environment.Crucially, existing generative models have been trained relying on the empirical distribution of these data, posing considerable challenges in synthesizing high-quality data that accurately reflects the true nature of molecular interactions.
To address the two primary limitations in 3D molecular generation, we propose incorporating intrinsic physicochemical principles into the generative model, drawing inspiration from the physics-informed neural network (PINN) (Karniadakis, Kevrekidis, Lu, Perdikaris, Wang and Yang, 2021;Hao, Liu, Zhang, Ying, Feng, Su and Zhu, 2022;Moon, Zhung, Yang, Lim and Kim, 2022).PINN is a comprehensive framework that enhances a model's generalization performance by integrating relevant physical properties as inductive biases within the neural network, rooted in extensive fluid dynamics research.By leveraging this concept, we develop PIDiff, a generative model that defines the inductive bias based on the fundamental physicochemical principle that stable binding between a protein and ligand occurs at positions where the binding free energy is minimized along their reaction coordinates.PIDiff is designed not only to learn the geometric information of molecules bound to target proteins but also to be trained to ensure that the generated molecules achieve minimal binding free energy with the target proteins.Binding free energy can be approximated as the sum of intermolecular interaction energies, such as Van der Waals (Bronowska, 2011;Bitencourt-Ferreira, Veit-Acosta and de Azevedo, 2019) interactions.By differentiating this value with respect to the positions of the generated atoms, we can condition the minimization of the binding free energy.To the best of our knowledge, this approach is the first attempt in molecular generation research to integrate the physicochemical principle of binding free energy minimization as an inductive bias to include the intrinsic principles of protein-ligand binding.The design of this 3D molecular generation model allows for the synthesis of higher-quality data that is closer to real-world scenarios.This is achieved by not only learning the empirical distribution of the existing training datasets but also by ensuring compliance with the underlying physical laws implicitly embedded within the datasets.Additionally, the physicochemical principles we aim to incorporate into our generative model are invariant truths that apply equally even in the dynamic and constantly vibrating thermodynamic environment.By doing so, we can more accurately reflect real-world environments compared to previous studies that focus solely on geometric information.
Commencing with the performance evaluation of the proposed model on the CrossDocked2020 (Francoeur et al., 2020) benchmark dataset, we conducted extensive experiments with a primary focus on the potential utility of molecular generative models in the drug development process.As a result, when compared to several contemporary models on the CrossDocked2020 benchmark, our model achieved stateof-the-art performance and demonstrated the stability of the molecules generated by our model in terms of molecular conformation.Furthermore, beyond just benchmark performance, we validated the capability of our model to generate molecules with high binding affinity in pocket structures obtained from various sources for key target proteins responsible for several diseases.We conducted an additional assessment of the proposed model by verifying its selectivity against off-target effects (Xie, Xie and Bourne, 2011;Whitebread, Hamon, Bojanic and Urban, 2005), which are a major concern in SBDD research.Lastly, we conclude our proposed comprehensive evaluation framework by integrating molecular dynamics simulations (MD) (De Vivo, Masetti, Bottegoni and Cavalli, 2016) to more realistically assess the actual behavior of molecules generated by the model within the protein pocket.This multifaceted evaluation provides a thorough understanding of our model's performance, realworld applicability, and selectivity in drug development processes.
Our main contributions can be summarized as follows: • We introduce PIDiff, the first model in 3D molecular generation research to incorporate the physicochemical principle that the binding of proteins and ligands occurs at the state where the binding freeenergy landscape is minimized.This approach allows for more feasible molecular generation by satisfying the physical laws that can be applied to the real world.A critical aspect of this process is that domain knowledge is simultaneously infused during the reverse process to entail that the binding free energy at the reaction coordinate between the inferred Real Molecule and a fixed protein pocket is minimized.
• Our experimental results demonstrate that our model, PIDiff, has set a state-of-the-art performance record by surpassing other baseline models in terms of the binding affinity of the molecules it generated for the protein pockets in the CrossDocked2020 benchmark dataset.
• We propose a diverse and rigorous experimental framework to evaluate the performance of generative models in SBDD from the perspective of their utility in drug development.The results offer various perspectives on the applicability of our model to drug development.

Problem definition
To represent molecules that bind to specific protein pockets, we construct protein-ligand pair , where  pair is the number of protein-ligand pairs.Here ,   and   represent the 3D coordinates of atoms within the protein pocket and ligand respectively, and   and   denote features such as atom types associated with the protein pocket atoms and ligand atoms.For brevity, we represent proteinn pocket as , and the molecule binding to the pocket as  = {(  , ,  , )}  pair =1 .Our goal is to generate  that binds to a given protein pocket .

Preliminaries
When aiming to generate a molecule that binds to a given protein pocket, the application of a DDPM (Ho et al., 2020) requires tractable distributions capable of modeling the molecule's atom coordinates and types.Previous research (Ho et al., 2020;Hoogeboom, Nielsen, Jaini, Forré and Welling, 2021;Guan et al., 2023) has developed a diffusion framework that utilizes Gaussian distribution  for modeling continuous variables (like atom coordinates) and categorical distribution  for discrete variables (such as atom types).The atom coordinates and types of the ligand are independently perturbed by introducing a small Gaussian noise and uniform noise at each timestep t, respectively.This process follows a Markov chain characterized by predetermined variance schedules   , as shown in Eq.( 1).The arbitrary step for both distributions can be expressed using   = 1 −   and ᾱ = Π  =1   in Eq.( 2), and the posterior of atom coordinates and atom types can be defined in closedform as in Eq.( 3) using the Bayes' rule.Detailed information is described in Supplementary Material (1).
The critical aspect in generating molecules in 3D space, particularly during learning of the reverse transition kernel , is to design a model that approximates a likelihood invariant to the translation and rotation of the protein-ligand complex.This approach is an essential inductive bias enables efficient exploration of the extensive search space associated with SE(3) transformations.Previous studies (Guan et al., 2023;Satorras et al., 2021;Xu et al., 2022)

Overview of proposed model
As illustrated in Fig. 1, we introduce PIDiff, a diffusionbased generative model infused with the chemical principles of stable states between proteins and ligands.Our model operates by fixing the position of protein atoms and injecting noise into the position and type of atoms in the Real Molecule, transforming it into a Noised Molecule through a forward process.It then undergoes a reverse process, removing the noise to revert back to the Real Molecule.A critical aspect of our model is that during the reverse process, it is not merely reconstructing the position and type of Real Molecule atoms but is designed to ensure that the ligand satisfies the physicochemical principles for stable binding to the target protein.

Equivariant reverse process
The reverse process, which is the opposite of the forward process that injects noise into the coordinates and types of atoms constituting a molecule as expressed in Eq.( 3), can be represented as shown in Eq.( 4).Our objective is to approximate these reverse processes through a neural network parameterized by .
Here, [⋅, ⋅] is the concatenation operator,   and   represent the 3D coordinates and atom types of the atoms constituting the ligand, respectively, while   and   similarly denote the 3D coordinates and atom types of the atoms constituting the protein.To ensure the generative process for molecule inference remains unaffected by the translation and rotation of complex structures, we have employed an approximation function defined by an SE(3)-equivariant graph transformer architecture (Guan et al., 2023;Hoogeboom et al., 2022;Guan, Qian, Ma, Ma and Peng, 2021).The model is designed to predict   0 and   0 , as in Eq.( 5).From these predicted values, it derives   and   of the reverse transition kernel, enabling the sampling process.Additionally, the methods for calculating X 0 and V  0 slightly differ from each other.A detailed description of the layers that constitute   can be found in the Supplementary Material (2).

Physics informed optimization
It crucial to conduct optimization both from a data perspective (to align the geometric distribution of molecules inferred by the generative model with the distribution of real data) and from a physics-informed perspective (to minimize the binding free energy between the inferred molecules and the target protein).Training from the data perspective follows previous research (Ho et al., 2020;Guan et al., 2023;Hoogeboom et al., 2021) by optimizing the variational bound of the negative log likelihood.For continuous variables, such as atom coordinates, training involves minimizing the KL-divergence between the forward Gaussian process , which can be precisely defined in a closed form (indicated in Eq.( 6)).For discrete variables, representing atom types, since the distribution is not Gaussian but categorical, training progresses by minimizing the KL divergence of the posterior for both forward and reverse processes at each time step, utilizing the predicted value V  0 , as outlined in Eq (7).Additional explanations for the two loss functions are summarized in Supplementary Material (4). where To perform optimization from physics-informed perspective, the binding free energy must be calculated based on the positions of atoms inferred by the generative model.This calculation begins with evaluating the energy associated with Van der Waals interactions, which is a critical step in determining the interaction dynamics between the protein and ligand (Pacholczyk and Kimmel, 2011).The minimization of binding free energy is of paramount importance to    ∈  (0, ) 3: , where  ∼ Gumbel(0, 1) ] 8: end for 9: return   0 ,   0 achieve a stable and favorable bond between the protein and ligand.Therefore, we prioritize the minimization of free energy as our primary target within the optimization process.To this end, we introduce constraints to ensure that derivative of the binding free energy with respect to the inferred atomic positions is zero.This methodology guides the generated ligand and target protein toward a state of minimized binding free energy, effectively promoting the formation of stable interactions.The total Lennard-Jones potential energy is represented as the sum of the energies for all possible pairs of protein atoms and ligand atoms, as expressed in Eq (8).
]‖ ‖ ‖ ‖ where  and  denote the index sets of ligand and protein atoms, respectively, while   represents the distance between a ligand atom and a protein atom.  and   refer to the 3D coordinates of atoms constituting the ligand and protein, defined as and ] ∈ ℝ  ×3 , respectively. and  ′  are constants since their values are not influenced by the positions of the generated atoms.
A crucial point to note here is that the energy of Van der Waals interaction we aim to model must satisfy the condition of being invariant to rotation and translation regarding the positions of atoms inferred by the generative model and the positions of protein atoms.As the likelihood of the reverse process in the diffusion model is mentioned to be invariant to translation and rotation in Section 2.2, we demonstrate the SE(3)-invariance of the Van der Waals interaction as in Proof: Proof.Let   () can be written explicitly as   () = +, where  ∈ ℝ 3×3 is the rotation matrix and  ∈ ℝ 3 is the translation vector.
□ We formalize the derivative of the distance between the predicted positions of the protein and ligand atoms with respect to the potential energy derived from the positions of atoms inferred by the generative model, as expressed in equation ( 9).
Therefore, our ultimate objective is to concurrently minimize the generative loss related to Eqs ( 6) and ( 7), as well as the derivative loss related to Eq (9), as expressed in Eq (10).
The generative process of the trained PIDiff is described in Alg.1.

Experiments
We remind ourselves of the purpose that the proposed model should be utilized to enhance efficiency and accelerate the drug development process, and pose the following research questions(RQs) to determine whether the proposed model can actually be applied in drug development, especially in SBDD.
• RQ1.Can molecules generated by the model effectively bind to target proteins?-Section 3.

Analysis of binding affinity of generated molecules to target proteins
To verify whether the trained generative model can infer molecules that can bind to a protein pocket, we employed the CrossDocked2020 (Francoeur et al., 2020) dataset.Furthermore, to clearly understand the model's generalization ability, we applied filtering and splitting to the dataset like in previous studies (Luo, Guan, Ma and Peng, 2021;Peng, Luo, Guan, Xie, Peng and Ma, 2022).The test and training sets were divided under the condition that the protein sequence identity between the two subsets was below 40%, and the sequence identity was computed using the MMseq2 (Steinegger and Söding, 2017).From this procedure, we obtained a high-quality training set containing 100,000 protein-ligand pairs and a test set composed of 100 proteins.We evaluated the superiority of the proposed PIDiff model by comparing the mean and median binding affinity of 100 molecules generated per protein pocket in the test set against various baseline models.The baseline models used for comparison with PIDiff were retrieved from recent studies, and descriptions of each model are provided in Supplementary Materials (3).The binding affinity was estimated by calculating the binding energy via the widely-used AutoDock Vina (Trott and Olson, 2010;Eberhardt, Santos-Martins, Tillack and Forli, 2021) docking tool.
The experimental results listed in Tab.1 verify the ability of PIDiff to generate molecules with high binding affinity to target proteins, as confirmed by various evaluation metrics.Using the AutoDock Vina tool, we obtained three versions of binding energy: Vina Dock, Vina Min, and Vina Score.The widely-used Vina Dock (Zhang, Zhang, Jin, Zhang, Hu, Shen, Cao, Du, Kang, Deng et al., 2023;Zhang and Liu, 2023;Schneuing, Du, Harris, Jamasb, Igashov, Du, Blundell, Lió, Gomes, Welling et al., 2022;Peng et al., 2022) metric yields the minimum binding energy (≃ maximum binding affinity) following docking, which fine-tunes the structure of the synthesized molecules and explores various binding poses.Furthermore, Vina Min assesses the binding affinity post-local energy minimization.Nonetheless, these methods(Vina Dock, Vina Min) may change the coordinates of the created atoms, possibly making them inappropriate for an unbiased evaluation of the generative model.Thus, we also utilized Vina Score, which evaluates the binding affinity while preserving the atomic coordinates as predicted by the model.High Affinity denotes the proportion of generated molecules that exhibit superior binding to each test protein compared with a reference molecule that is bound to a protein pocket on the test set.SR (Success Rate) is defined as the success rate, representing the proportion of the 100 protein pockets in the test set where at least one generated molecule exhibits stronger binding affinity than the ligand bound in the test set pocket.When we reiterate that these generative models have the responsibility to create structures of molecules that could potentially become new drugs, it is crucial not only to measure the binding affinity of all generated molecules but also to analyze the binding affinity of molecules within the synthesizable range.To this end, we adopted a threshold for the generally synthesizable SA (synthetic accessibility) values mentioned in (Popova, Isayev and Tropsha, 2018).We measured the binding affinity of molecules generated by each model that surpassed the SA threshold, defined as Vina Score SA .Finally, we adopted the PoseBusters Buttenschoen, Morris and Deane (2024) test suite to evaluate the physical and chemical validity of the generated molecules.This benchmark tool performs a total of 12 tests from three perspectives: chemical validity and consistency, intramolecular validity, and intermolecular validity.A generated molecule that passes all 12 tests is defined as a valid molecule.Additionally, the performance for various metrics that characterize the molecular properties is provided in the supplementary materials (8).
As listed in Table 1, the proposed PIDiff, substantially outperforms the comparison models across the six evaluation metrics.Compared with the previous state-of-theart Targetdiff model, PIDiff shows a notable improvement with a 21% increase in Vina Score, 12% in High Affinity, 13% in Vina Score SA , and 9% in SR.Notably, as seen in the SR metric, our model can generate molecules that exhibit superior binding to all protein pockets in the test set compared with the reference molecules.These results indicate that prior chemical knowledge (that is, the complex structures of a protein and ligand should be located at a minimum in the binding free-energy landscape) is crucial for rational molecule design.This is particularly evident when comparing our model to Targetdiff, which similarly uses a diffusion generative model to generate molecules.The ablation case of the PIDiff model, which does not consider physicochemical principles, corresponds to the performance of TargetDiff.Additionally, an interesting observation is that for some comparison models, the binding energy (Vina Score) between the generated molecules and the target protein can result in positive(+) values.This indicates unrealistic binding poses, and docking of these binding poses yields negative(-) values for the binding energy (Vina Dock).Consequently, despite the generative model producing molecules with unrealistic binding poses, the Vina Dock metric adjusts them to realistic binding poses for calculating the binding energy.This observation highlights that the evaluation based on the Vina Score, which calculates the binding energy with fixed molecular positions inferred by the generative model, offers a fair assessment of the model performance.
Regarding the Valid PB metric, PIDiff's performance ranks third, following Pocket2Mol and AR.According to Supplementary Material (9), the performance gap between our model and the Pocket2Mol (or AR) models is primarily due to many molecules failing the bond angle test, one of the 12 tests conducted.For models like Pocket2Mol and AR, the molecule generation process adopts an autoregressive sampling approach, where atoms are sequentially predicted to form a complete molecule.This process ensures that the geometric rationality among previously predicted atoms is maintained when predicting the next atom.Consequently, many molecules pass the bond angle test in the PoseBusters suite.However, our model employs a non-autoregressive sampling method, inferring an entire molecule at once rather than predicting atoms sequentially.As a result, it shows comparatively lower performance in the bond angle test than autoregressive sampling models.However, autoregressive sampling models are prone to cumulative errors and exposure bias due to discrepancies between the training and sampling processes Xiao, Wu, Guo, Li, Zhang, Qin and Liu (2023).Consequently, they perform significantly worse than PIDiff in all evaluation metrics except for the Valid PB metric.This indicates that these models do not generate molecules with strong binding affinities to the target protein, making them unsuitable as drug candidates.Therefore, autoregressive models are not ideal for generative tasks in structure-based drug design.

Stability of generated molecules in terms of binding energy
Docking tools widely used in SBDD typically return the optimal pose and structure of an input ligand through various optimization processes (Meng, Zhang, Mezei and Cui, 2011).Inspired by molecular docking simulation, we conducted experiments to verify the stability of molecules generated by our model from an energy perspective.This assessment involved comparisons of the molecular structures inferred by the generative model with those optimized through docking simulations.If the molecular conformations before and after docking have a large difference, it implies that the molecular structure and binding pose generated by the model was chemically unfavorable for binding with the target protein, leading to substantial adjustments to the input molecular structure during docking.In such cases, the generative model may be insufficient for producing  This means that significant adjustments to the generated molecule were made during docking to achieve an ideal protein-ligand binding.Consequently, it implies that the generated molecule has a structure and pose disadvantageous for binding to the protein.In contrast, Case.B shows that the difference in conformation between the generated molecule and the post-docking molecule is relatively small.This implies that not many adjustments were needed for the generated molecule during docking to achieve ideal proteinligand binding.Consequently, it suggests that the generated molecule has a structure and pose advantageous for binding to the protein.
molecular conformations that are reasonable from a freeenergy perspective.A brief example illustrating this aspect is described in Fig. 3. Considering the interpretability of this evaluation method, the experimental results shown in Fig. 2 allow us to assess the stability of molecules generated by the proposed model, including comparison models.
The experimental result showcass the distribution of conformational differences in molecules generated by each model before and after docking.The conformational differences were obtained by calculating the root mean square deviation (RMSD) of molecule positions before and after docking.The molecules generated by the proposed PIDiff model exhibit substantially less structural change after docking than those generated by five comparison models.This result clearly indicates that our model can infer the most favorable molecules for binding.Moreover, it highlights the importance of not only learning the distribution of molecular positions and structures during model training but also incorporating prior physicochemical knowledge into the generative model.
For a deeper analysis, we conducted several case studies.We visualized molecules generated for specific protein pockets both in their original form and after docking.For concise notation, we will henceforth refer to the undocked generated molecules (original form) as pre-docking molecules, and those subjected to docking as post-docking molecules.The protein-ligand complexes were categorized into three groups as molecules: 1) obtainable from the Cross-Docked2020 dataset (reference), 2) generated by the proposed PIDiff model, and 3) generated by the latest model, ResGen (Zhang et al., 2023).In Fig. 4, the molecules in cyan color spectrum indicate pre-docking molecules, while those in red color spectrum signify post-docking molecules.The molecules generated by the proposed PIDiff model exhibit negligible conformational differences between pre-and post-docking.In contrast, molecules generated by ResGen show significant structural changes after docking, implying the need for considerable fine-tuning for favorable binding with the protein.Furthermore, this case study provides a visual rationale for the distribution of low RMSD value observed for PIDiff in the previous experiment (Fig. 2), as well as the distribution of high RMSD value noted for other comparative models.Additionally, the visualization results from this case study underscore the need for a cautious approach when evaluating the performance of target-aware 3D molecule generation models based on the outcomes (Vina Dock) of docking simulations.Additional explanations regarding the molecules presented in the Fig. 4 are available in Supplementary Material (7).

Applicability of generated molecules generated to target proteins of various diseases
While achieving high performance on benchmark sets such as CrossDocked2020 is undoubtedly important, it is essential to remember that the target protein-aware 3D molecule generative models, including the model proposed in this study, are intended to accelerate the SBDD task.Accordingly, to evaluate the practical utility of 3D molecular generation, it is crucial to verify whether these models can design reasonable molecules for proteins that are actual drug targets, not just those proteins included in the benchmark dataset.To bridge the gap between artificial intelligence (AI) generative models and practical SBDD, we conducted experiments to explore the feasibility of generating candidate compounds for inhibitors of kinases, which play a major role in the oncogenesis and metastasis of various cancers (Bhullar, Lagarón, McGowan, Parmar, Jha, Hubbard and Rupasinghe, 2018).
Observing the Vina score patterns of molecules generated by PIDiff across all six cases, considering the reference drugs are compounds actually binding to the proteins, it can be inferred that the generated molecules possess the capacity to produce numerous molecular structures capable of functioning as kinase inhibitors.A quantitative analysis reveals that the proportions of molecules with superior binding affinity to the reference drugs for RET, ERBB2, ABL1, EGFR, ALK, and KIT are 36%, 13%, 38%, 84%, 35%, and 52%, respectively.Notably, the most of molecules generated for the EGFR protein demonstrate superior binding affinity to the binding drug, which is a remarkable result.Based on this case study, our proposed model, PIDiff, suggests that it not only achieves high performance on benchmark datasets but also possesses the capability to generate molecules with competitive binding affinities for target proteins aimed at treating various diseases in practice.
However, when broadening the perspective on experiments within real-world settings, the possibility of the absence of the target protein structure cannot be overlooked.Considering that, out of the 250 million proteins with known sequences available through the UniProt (Universal Protein Resource) (uni, 2023), only approximately 210,000 protein structure entries are available for 67,000 unique proteins via the PDB (Berman et al., 2000) database, the mention of the above possibility appears to be reasonable.To address this bottleneck, we conducted additional experiments incorporating the AlphaFold (Jumper et al., 2021) system, which offers unprecedented accuracy in protein structure prediction.
For this purpose, we selected FLT3 and VEGFR2 proteins, which are experimentally known to be active with the  target drug sunitinib (Roskoski Jr, 2007).These proteins are well-studied as target proteins for kinase inhibitors (Shibuya, 2010).We extracted the sequence information for the ATP binding sites of these two proteins and used AlphaFold (Varadi, Anyango, Deshpande, Nair, Natassia, Yordanova, Yuan, Stroe, Wood, Laydon et al., 2022;Varadi, Bertoni, Magana, Paramval, Pidruchna, Radhakrishnan, Tsenkov, Nair, Mirdita, Yeo et al., 2024) to obtain the structural information for these sequences.Subsequently, we generated 1000 molecules for each protein using the same manner as the experiment for Figure 5.The Vina scores of these molecules were then compared to the Vina score of the reference drug sunitinib.Additionally, to verify the reliability of the molecules generated using the AlphaFoldpredicted protein structures, we compared their Vina scores with those of molecules generated based on experimentally determined structures.In other words, we obtained PDB files (FLT3: 4RT7, VEGFR2: 1YWN) containing the actual structural information for the sequences used to obtain the protein structures via AlphaFold, and then generated 1000 molecules for these experimentally determined structures as well.The distribution of Vina scores for the 1000 molecules generated for the AlphaFold version of the protein is shown in gray, while the distribution for the 1000 molecules generated for the PDB version of the protein is shown in green.The Vina score of the reference drug, sunitinib, is indicated by a blue bar.Among the molecules generated for the Al-phaFold version of the protein, the proportions of molecules with stronger binding affinities than sunitinib were 58% for FLT3 and 74% for VEGFR2 (In this comparison, the Vina scores for both the generated molecules and sunitinib were calculated using the PDB version of the protein.).As shown in the figure 6, the distribution of Vina score for the molecules generated using the structure obtained from the PDB and those generated using the structure obtained from AlphaFold are not significantly different.This indicates that generating new molecules using structures predicted by AlphaFold is a valid approach even when the exact structure of a protein is unknown.In summary, we can conclude that our model, PIDiff, has the capability to generate competitive molecules for target drugs using structures obtained from AlphaFold.

Selectivity of generated molecules
Kinases that regulate various signaling pathways in human diseases, including cancer, vascular diseases, diabetes, inflammation, and degenerative diseases, have become attractive targets for new drug development.Most kinase inhibitors bind to the ATP(Adenosine TriPhosphate) binding site within the kinase catalytic domain.However, the high structural conservation across the ATP binding sites of various kinases can lead to off-target effects (Attwood, Fabbro, Sokolov, Knapp and Schiöth, 2021), with kinase inhibitors binding to kinases other than the target protein.This can cause undesirable side effects, ultimately undermining the clinical effectiveness of a drug (Grünwald, Heinzer and Fiedler, 2007;Wolter, Stefan, Decallonne, Dumez, Bex, Carmeliet and Schöffski, 2008) and remaining a challenging issue.
To address this challenge, we considered the hypothesis that molecules generated based on the structure of a specific protein have a relatively weak binding to proteins with different structures and aimed to validate this hypothesis using our model.For the experiment, we selected sorafenib, a kinase inhibitor targeting the B-Raf protein approved by the FDA(Food and Drug Administration) for treating liver and kidney cancer, and identified its off-target proteins (PKB, ERK1, PIM1, MEK1, CDK1, PKC, IGFR1, EGFR, c-MET,  HER2) through our curation, drawing on previous studies (Karaman, Herrgard, Treiber, Gallant, Atteridge, Campbell, Chan, Ciceri, Davis, Edeen et al., 2008).The PDB IDs for the structures of each protein are described in Supplementary Material (6).
After calculating the Vina score for the B-Raf protein, we selected molecules with binding affinities superior to sorafenib and that surpassed the SA value threshold used during the calculation of Vina Score SA in Section 3.1.These selected molecules were then calculated for their vina scores against 10 off-targets.The distribution of Vina scores for the selected molecules, both on-target (B-Raf) and off-targets, are shown in Fig. 7.The molecules generated for B-Raf tend to bind less strongly to off-target proteins with structures different from the on-target protein.This observation aligns with our hypothesis to some extent.Overall, molecules generated by PIDiff for a target protein pocket can achieve selective binding by not adhering to other proteins.This demonstrates the potential applicability of our model for SBDD, suggesting its utility in achieving targeted binding with high selectivity.

Stability of generated molecules from thermodynamics perspective
A molecular docking simulation has exceptional efficiency in measuring the binding affinity of various molecules with proteins and rapidly optimizing those molecules.For fast processing, it superficially handles various aspects such as the solvation energy without properly constructing the environment of real biological systems, consequently sacrificing accuracy (Śledź and Caflisch, 2018;Kitchen, Decornez, Furr and Bajorath, 2004;Carlson, 2002).To overcome these limitations, in-silico based drug design traditionally employs MD simulation, which models the thermodynamic behavior of individual particles over time, allowing for a detailed understanding of protein-ligand interactions within a realistic biological system environment (e.g., aqueous state) on a temporal basis.However, to date, attempts to evaluate the performance of generative models using MD simulation have not been made in molecular generation research.This gap represents one of the reasons why the divide between actual drug design and molecular generation studies has not been narrowed.
Therefore, we propose a novel experimental framework in our molecular generation research for SBDD that involves validating molecules generated by the generative model through MD simulations.In this framework, after the broad and general performance of the generative model has been assessed through a docking simulation, a more detailed evaluation is conducted by comparing the MD simulation results of molecules generated by the model with the behavior of actual drugs.This comparison enables a rigorous assessment of the performance of individual molecules.
We performed clustering based on structural similarity among the generated molecules and selected one molecule from each cluster for MD simulation.This approach aims to assess the results of the MD simulations by selecting representative molecules from each cluster, thereby allowing for an indirect evaluation of the entire set of generated molecules.We used ECFP (Rogers and Hahn, 2010) fingerprints to represent the molecular structures and employed kmeans clustering with k = 3.To visualize the results of the k-means clustering, we plotted the ECFP fingerprints using PCA (Principal Component Analysis) and displayed the clusters as shown supplementary materials (10).From each cluster, we selected one molecule using the same criteria applied in Section 3.4 and performed MD simulations for these representative molecules.The target protein was selected as the B-Raf protein, which was utilized in section 3.4, and MD simulations were performed on Sorafenib, known to bind to the B-Raf protein.The preprocessing of the protein and ligand for simulation was supported by the CHARMM-GUI (Jo, Kim, Iyer and Im, 2008) tool, and MD simulations were conducted over 10 ns using the ABMER (Salomon-Ferrer, Case and Walker, 2013;Arantes, Polêto, Pedebos and Ligabue-Braun, 2021) tool.The detailed procedure is outlined in our source code.From the simulation results, we measured the distance (i.e., RMSD) between the protein pocket and ligand across all trajectory files over time for both for sorafenib and for the three generated molecules.As shown in Fig. 8, compared with the RMSD over time for the reference sorafenib drug, the three molecular structures generated by PIDiff exhibits a stable fluctuation range over time, displaying an overall lower RMSD profile.An increasing RMSD indicates either an unstable ligand structure or weaker binding affinity with the protein (De Vivo et al., 2016).Thus, we can conclude that the molecular structure generated by our model can stably bind to the target protein.These experimental results are a critical indication that molecules generated by generative models, such as the proposed PIDiff, possess competitive binding capabilities to target proteins in real physiological environments, even when compared with approved drugs.

Conclusion
In this study, we propose the PIDiff model that can generate rational and realistic drugs capable of binding to the structure of a target protein pocket.This research stems from the necessity of considering not only geometric information of complex structures but also the intrinsic principles of binding between proteins and ligands.To assess our model, we first evaluated the binding affinity of molecules generated for target proteins on a benchmark dataset.Then, we investigated the structures of target proteins that caused real diseases to determine the practical applicability of the model.Additionally, we verified whether the molecules generated by our model could exhibit selectivity, a critical challenge in SBDD.Finally, we validated whether the molecules produced by our model demonstrated stable binding patterns within biological systems when compared with real drugs.Overall, experiments from various perspectives have confirmed the potential of PIDiff as a valuable tool for accelerating drug development.
While generating molecules that stably bind to target proteins is a crucial step, it is not the endpoint in drug development.To be viable drug candidates, the generated molecules must also meet various criteria, such as ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) Ferreira and Andricopulo (2019).Although addressing these issues is essential, our study primarily focused on generating molecules that bind to target proteins, rather than optimizing for efficacy or other drug-like properties.Consequently, we did not explore these aspects in depth.To address these challenges, we believe that latent space optimization Rombach, Blattmann, Lorenz, Esser and Ommer (2022); Wallace, Gokul, Ermon and Naik (2023) or manipulation Park, Kwon, Choi, Jo and Uh (2023) and guidance sampling techniques Song, Sohl-Dickstein, Kingma, Kumar, Ermon and Poole (2020); Dhariwal and Nichol (2021) will be key solutions in the future.In particular, applying guidance sampling methods, which are gaining attention in various condition generation tasks such as the inverse problem, to our drug development tasks could be highly promising.This approach could enable the generation of molecules with desired properties for structure-based drug design (SBDD) tasks, making it an exciting area of future research.

Supplementary Material (2)
Ultimately, the training of a diffusion-based generative model involves approximating the mean (µ θ ) of the reverse process at each timestep to the mean ( μ) of the forward process.At that time, learning the µ θ of the reverse process can be broadly classified into three methods through the use of a reparameterization trick suitable for each objective: deriving X 0 from X t at an arbitrary timestep and then learning based on X 0 , deriving noise from X t at an arbitrary timestep and then learning based on the noise, and lastly, deriving the score from X t and learning based on that score.Among these three methods, we adopt the approach of TargetDiff [4], which has recently been successfully experimented with, to train our diffusion model.Additionally, the training of the diffusion process for the categorical distribution follows the approach of Argmax Flow [3], which has been successful in generation of discrete data.
To generate molecules in 3D form, we need to model both the three-dimensional coordinates of each atom and the type of that atom, respectively.Therefore, the backbone architecture for performing the diffusion process can be represented as XM , t, X P , V P , as mentioned in the manuscript.Each function used for predicting X M 0 and V M 0 is defined as follows: where index i represents the atom index to be modeled and V denotes the set of atoms in proximity to the atom with index i.Further, h is obtained from the embedding matrix encoding the atomic type, and h i,l indicates the embedding value of the atom corresponding to index i in the l-th layer.d i j represents the Euclidean distance between two atoms i and j, and e i j is an indicator for the connection between the two atoms.X represents the 3D coordinates of the protein and ligand atoms, and it can be rewritten as [X P ,X M ].The mask 1 X M is designed to fix the coordinates of atoms related to protein while allowing updates only for the ligand atoms.The layers that make up f h and f X employ the graph transformer architecture [5], and the parameters of these two types of layers( f h , f X ) are partially shared.The last layer embedding h L is processed through a multi-layer perceptron and a softmax function, resulting in the derivation of V M 0 , and X M 0 is defined as the value corresponding to the index of the ligand atom in the last layer X L t .

Supplementary Material (3)
(a) AR: This 3D generative model estimates the probability density of atoms in 3D space, given a designated protein binding site as context.It employs an auto-regressive sampling scheme to sequentially generate 3D molecules, stopping when no space remains for new atoms.
(b) liGAN: This deep learning system generates 3D molecular structures conditioned on a receptor binding site, approach in structure-based drug discovery.It utilizes a conditional variational autoencoder trained on atomic density grids of cross-docked protein-ligand structures.Atom fitting and bond inference are applied to construct valid molecular conformations from generated atomic densities.The system evaluates the properties of generated molecules, showing significant changes when conditioned on mutated receptors.The latent space of the generative model is explored through sampling and interpolation.
(c) GraphBP: This framework generates 3D molecules binding to specific proteins using a 3D graph neural network.It obtains geometry-aware and chemically informative representations from contextual information, including the binding site and previously placed atoms.The generation process sequentially generates atom types and their relative locations using a local spherical coordinate system.
(d) Pocket2Mol: An E(3)-equivariant generative network for efficient molecular sampling based on 3D protein pockets.It comprises a graph neural network for spatial and bonding relationships in binding pockets and an algorithm for sampling drug candidates conditioned on pocket representations.The network considers 3D coordinates, bond types, and functional groups, sampling drug candidates from a tractable distribution without MCMC methods.
(e) DiffSBDD: Introduces protein-conditioned and ligand-inpainting generation strategies.Protein-conditioned generation treats the protein as a fixed context, while ligand-inpainting models the joint distribution of the protein-ligand complex, inpainting new ligands during inference.
(f) TargetDiff: This study proposes a 3D equivariant diffusion model for target-aware molecule generation and affinity prediction.It learns a joint generative process for atom coordinates and types using a SE(3)-equivariant network.The model overcomes challenges of voxelized atom densities and autoregressive sampling, being rotation-equivariant and respecting geometric constraints.It serves as an unsupervised feature extractor for binding affinity estimation and improves binding affinity ranking and prediction without retraining.
(g) ResGen: This method predicts new atom types and positions based on neighboring atoms' structural and geometric information.It uses scalar and vector features, along with a Gaussian mixture distribution, to determine the position distribution of newly generated atoms.
predicting each atom to form a complete molecule by iteratively predicting the position of the next atom based on the previously predicted atoms.At this point, when predicting the position of the next atom, the models ensure that the geometric rationality among the previously predicted atoms is maintained, resulting in superior pass rates in the bond angle criterion tests.However, the autoregressive sampling method introduces bias due to the discrepancy between the model's behavior during training and sampling.Additionally, this approach may generate unrealistic fragments because it cannot grasp the overall context of the molecular structure during the initial stages of sampling.Furthermore, in this approach, the output of each step is used as the input for the next step.Errors that occur at each stage can accumulate over iterations, adversely affecting the quality of the final output.Consequently, this can lead to the generation of unrealistic and functionally inadequate molecules.To overcome these issues, models like ours, including TargetDiff and DiffSBDD, generate molecules using a non-autoregressive approach.This non-autoregressive method creates an entire molecule at once rather than adding atoms step by step.By generating the molecule in a single step, we can address the limitations associated with the autoregressive sampling method.
To summarize, the lower pass rates in the bond angle criterion for models like ours, including TargetDiff and DiffSBDD, can be explained by the fact that these models do not adopt a step-by-step approach to add atoms based on geometric rationality.However, as confirmed in the revised Table 1, the non-autoregressive sampling method is more effective for the primary objective of creating molecules that strongly bind to the target protein.Furthermore, both Pocket2Mol and AR models, which use autoregressive sampling, generate molecules with weaker binding affinities compared to the molecules in the test set.Therefore, the molecules generated by these models cannot be considered strong candidates for drug discovery, as they do not exhibit strong binding to the target protein.This suggests that while autoregressive sampling ensures the geometric validity of bond angles, it does not necessarily achieve strong binding affinities to the target protein.This implies that if the geometric rationality of bond angles between atoms in molecules generated using non-autoregressive sampling can be further improved, the PB-valid metric could become competitive with those of autoregressive models.To this end, future research focusing on enhancing the validity of bond angles in generated molecules-such as incorporating the torsion angles of atoms constituting the ligand during model training-holds great promise.This approach could significantly improve the geometric rationality of generated molecules, making it an exciting and promising area for further investigation.

Figure 1 :
Figure 1: Overview of PIDiff.The forward process incrementally injects noise into a Real Molecule( 0 ) until it transforms into a Noised Molecule(  ) across  timesteps.In contrast, the reverse process progressively removes noise to convert a Noised Molecule back into a Real Molecule through a function parameterized by .A critical aspect of this process is that domain knowledge is simultaneously infused during the reverse process to entail that the binding free energy at the reaction coordinate between the inferred Real Molecule and a fixed protein pocket is minimized.

Figure 2 :
Figure 2: Distribution of the change in structure difference between pre-docking and post-docking for the generated molecule

Figure 3 :
Figure 3: Case.A indicates that there is a relatively large difference in conformation between the generated molecule and the post-docking molecule.This means that significant adjustments to the generated molecule were made during docking to achieve an ideal protein-ligand binding.Consequently, it implies that the generated molecule has a structure and pose disadvantageous for binding to the protein.In contrast, Case.B shows that the difference in conformation between the generated molecule and the post-docking molecule is relatively small.This implies that not many adjustments were needed for the generated molecule during docking to achieve ideal proteinligand binding.Consequently, it suggests that the generated molecule has a structure and pose advantageous for binding to the protein.

Figure 4 :
Figure 4: Visualization of binding pose of reference molecules and molecules generated by PIDiff, and ResGen on protein 3CHC, 5W2G, 3KC1(PDB ID).Among the generated molecules, those in the cyan color spectrum represent pre-docking molecules, while those in the red color spectrum signify post-docking molecules.

Figure 5 :
Figure 5: The distribution of Vina Score for molecules generated by PIDiff for six target proteins of kinase inhibitors, for which structures are obtainable from the PDB database.The red bars represent the Vina Score of drugs known to actually bind to each target protein.

Figure 6 :
Figure 6: The distribution of Vina Score for molecules generated by PIDiff for two target proteins of kinase inhibitors.The Vina scores for molecules generated using both the AlphaFold version and the PDB version of the protein structures were measured against the PDB version of the protein structure.

Figure 7 :
Figure 7: Box plot of Vina scores for molecules generated for the B-Raf protein, shown for on-target (red box) and off-targets (gray box).Superior Vina scores for on-target and less favorable scores for off-targets imply excellent selectivity.

Figure 8 :
Figure 8: Changes in the distance between the protein and ligand over time during MD simulations for the B-Raf protein with two molecules (Sorafenib, generated molecule) (a) example/3chc B rec.pdb example/3chc B rec ligand.sdf(b) example/5w2g A rec.pdb example/5w2g A rec ligand.sdf(c) example/3kc1 A rec.pdb example/3kc1 A rec ligand.sdf

Algorithm 1
Inference procedure Input: Protein pocket atoms  and trained PIDiff model   Output: Generated molecule  1: Sample initial molecular atom coordinates    and atom types

Table 1
Evaluation of generated molecules in terms of binding affinity by our model(PIDiff) and other baselines for proteins from CrossDocked testset.

Table 1 :
The binding affinity of molecules generated by each model and the RMSD difference of the changed molecular conformation before and after docking.