GTAM: a molecular pretraining model with geometric triangle awareness

Abstract Motivation Molecular representation learning is pivotal for advancing deep learning applications in quantum chemistry and drug discovery. Existing methods for molecular representation learning often fall short of fully capturing the intricate interactions within chemical bonds of 2D topological graphs and the multifaceted effects of 3D geometric conformations. Results To overcome these challenges, we present a novel contrastive learning strategy for molecular representation learning, named Geometric Triangle Awareness Model (GTAM). This method integrates innovative molecular encoders for both 2D graphs and 3D conformations, enabling the accurate capture of geometric dependencies among edges in graph-based molecular structures. Furthermore, GTAM is bolstered by the development of two contrastive training objectives designed to facilitate the direct transfer of edge information between 2D topological graphs and 3D geometric conformations, enhancing the functionality of the molecular encoders. Through extensive evaluations on a range of 2D and 3D downstream tasks, our model has demonstrated superior performance over existing approaches. Availability and implementation The test code and data of GTAM are available online at https://github.com/StellaHxy/GTAM.


Introduction
Deep learning has increasingly influenced molecular chemistry, yielding notable progress in quantum chemistry (Dral 2020) and drug discovery (Stokes et al. 2020, Liu et al. 2023b).Acquiring molecular representation embeddings is crucial for advancing deep learning in this field.These embeddings are key in the design of functional and innovative compounds using deep learning methodologies (Jumper et al. 2021).
Molecules can be depicted through two distinct modalities: two-dimensional (2D) graphs and three-dimensional (3D) conformations.The 2D graphs highlight the topological connections among atoms, represented by nodes (atoms) and edges (chemical bonds).The 3D conformations focus on spatial atom arrangements, defined by each atom's specific spatial coordinates within the molecule.Recent studies (Sch€ utt et al. 2017, Luo et al. 2022, St€ ark et al. 2022, Zhou et al. 2022, Liu et al. 2023a) have demonstrated that enhancing the mutual information between these two modalities improves molecular representation capabilities.Notable multimodal pretraining approaches for molecules, such as GraphMVP (Liu et al. 2021) and 3DInfomax (St€ ark et al. 2022), utilize various objectives to facilitate the exchange of multimodal information.These methods rely fundamentally on the encoders for 2D graphs and 3D conformations.
Existing encoders used for molecular embedding exhibit certain limitations (Garg et al. 2020).First, conventional 2D graph neural networks like the GIN (Xu et al. 2018) and Graphormer (Ying et al. 2021) update node embeddings via edge information aggregation and vice versa.In chemistry, a chemical bond is often influenced by other chemical bonds to which it is connected (Ruedenberg 1962) which highlights the intricate interdependencies within a molecule's structure.Nevertheless, current graph neural networks used to learn molecular representation embeddings do not adequately model the interrelations of chemical bonds in molecules.Second, for 3D conformation, the inter-edge relationships, which involve physical constraints, have not been effectively addressed in previous graph networks, such as SchNet (Sch€ utt et al. 2017) and DimeNet (Gasteiger et al. 2020b).GeomGCL (Li et al. 2022) attempted to address this issue by proposing edge-to-edge updates in graph data, yet this approach only considers the angle relation between edges.Moreover, most current multimodal pretraining approaches (Liu et al. 2021, St€ ark et al. 2022, Zhu et al. 2022, Liu et al. 2023a) focus on maximizing mutual information (MI) across modalities, thereby ensuring that learned representations capture molecular shared information.These methods all focus on the exchange of node information in molecules.Notably, there is a correlation between edge attribute information in 2D topological graphs and edge length information in 3D geometric conformations of molecules (Ruedenberg 1962).The types of chemical bonds significantly influence molecular physical and chemical properties, such as boiling points and solubility.Understanding these bond types and the corresponding distances between atoms is crucial for comprehending and predicting molecular property (Pauling 1931).Yet, previous methods have limitations in effectively learning and sharing edge-related information across different modalities during pretraining.
To address these limitations, we introduce a molecular contrastive learning method called Geometric Triangle Awareness Model (GTAM).GTAM aims to maximize the mutual information using contrastive self-supervised learning (SSL) and generative SSL (Liu et al. 2021(Liu et al. , 2023a)).First, we use diffusion generative models for generative SSL which can lead to a more accurate estimation in generative SSL.Second, to enhance molecular representations in contrastive SSL, we introduce new molecular encoders that incorporate a novel geometric triangle awareness mechanism to enhance edge-toedge updates in molecular representation learning, in addition to node-to-edge and edge-to-node updates, unlike other molecular graph encoders (Sch€ utt et al. 2017).Drawing inspiration from the triangle update methods utilized in AlphaFold2 (Jumper et al. 2021), our geometric triangle awareness update mechanism employs a self-attention mechanism for the dynamic integration of other edges and structural information.Third, to reduce information loss in contrastive SSL, GTAM also incorporates two contrastive training objectives to enhance the fusion of edge information across the different modalities.With advanced molecular encoders and our contrastive training objectives, GTAM has demonstrated state-of-the-art performance in downstream tasks both for 2D graphs and 3D conformations, our model works best on 22 out of 28 tasks compared with previous methods.

Preliminaries
We employ G f2D;3Dg ¼ V f2D;3Dg ; E f2D;3Dg À � to represent the 2D topological graph and 3D geometric conformation of a molecule.In the 2D topological graph, atoms and chemical bonds are respectively represented as nodes u 2 V 2D and edges e 2 E 2D .In the 3D geometric conformation, each atom is denoted as a node in the set V 3D , and we construct a fully connected graph with the distance map comprising the edges E 3D .More details of molecular featurization are exhibited in Supplementary Appendix A.1.

Contrastive framework
To obtain powerful molecular representation from 2D graphs and 3D conformations, GTAM adopts a contrastive learning strategy to maximize the mutual information between 2D graphs and 3D conformations like previous work (St€ ark et al. 2022, Liu et al. 2023a) (Fig. 1) and aims to maximize the conditional generative probability for 3D molecular representations given their corresponding 2D representations, and vice versa.The maximization of mutual information (MI) is reformulated as the summation of two conditional loglikelihoods. (1) Here, we use T and C for the 2D and 3D graphs for notation simplicity, i.e.T¢G 2D and C¢G 3D .GTAM leverages a diffusion generative model approach to approximate these conditional probabilities, along with two novel molecular encoders for 2D graphs and 3D conformations.We introduce an additional mechanism for maximizing MI, named GTAInfomax.
The training objective of GTAInfomax aims to maximize the paired 2D and 3D molecular representations and minimize the similarity between unpaired representations within the same batch.The details and comprehensive explanation of GTAInfomax are thoroughly documented in Supplementary Appendix A.2.

Molecular encoder
Our approach incorporates the geometric triangle awareness update mechanism within both the 2D graph encoders (GTA-2D) and 3D conformation encoders (GTA-3D).This module includes three updating components: node-to-edge, edge-toedge, and edge-to-node, which capture comprehensive interrelated information among the elements of a molecular graph.The following sections will be divided into two parts: First, we describe the detailed implementation of these components, and then we elaborate on our molecular graph encoders for 2D graphs and 3D conformations.

Node to edge
First, we employ node embedding h u to update edge embedding z uv , with the updating process illustrated as follows: (2)

Edge to edge
To capture the intricate interdependence between edges within the molecular graph, we use two different triangle updates.
The edge update process mainly comes in two forms.The first is to directly update the third edge using the information of the two adjacent edges in a triangle.The second method involves updating through an attention mechanism.
In the first updating forms, we update the edge embedding z uv through a linear layer. (5) In the second updating forms, the function f 2 is utilized to project the edge information z uv through different linear layers into the form of an attention mechanism and a bias.z uv through different linear layers to obtain the q e uv ; k e uv ; v e uv in the attention mechanism and the bias b uv in the triangle update.
Subsequently, the function ϕ 2 is updated through an attention-based approach.Differently, in the computation of the attention scores, the information of the third edge is incorporated as a bias term. (7)

Edge to node
To update the information of nodes based on the distance or chemical bond information on the edges, we subsequently use the method of edge-to-node updates.
We aggregate the node information and edge information separately along rows and columns.
After that, we update the obtained normalized node information through a three-layer multi-layer perceptron.For a thorough understanding of the computational efficiency and scalability of the geometric triangle awareness update mechanism, an in-depth analysis of the time complexity has been conducted in Supplementary Appendix A.3.
To enhance molecular encoding through the integration of geometric triangle awareness updates, we employ a tailored approach for both 2D and 3D molecular structures.In 2D graphs, GTA-2D initially processes nodes and edges using two layers of the graph isomorphism network (Xu et al. 2018).These initial layers enable us to obtain fundamental embeddings for nodes and edges, capturing twohop information within the graphs.We then apply our geometric triangle awareness update method in the 2D graphs.This method is adept at efficiently capturing global edge information and updating edge embeddings with other edges.The node and edge embeddings are denoted as h 2D u and z 2D uv .Regarding 3D conformations, GTA-3D first encodes atom coordinates and nuclear charges with a layer of cfconv (Sch€ utt et al. 2017) to establish basic embeddings.Following this, we implement the geometric triangle awareness update within a fully connected graph for 3D conformations.Owing to the nature of the fully connected structure, our update method can rapidly capture all atomatom interactions and incorporate these many-body effects into both node and edge embeddings, which are denoted as h 3D u and z 3D uv .

Diffusion processes
To effectively estimate the conditional probabilities log p CjT ð Þ and log p TjC ð Þ, we utilize diffusion generative This approach is designed to enhance molecular representation through a comprehensive understanding of the geometric interrelationships within molecular structures models which are adept at capturing the complex transition from 2D graphs, denoted as T ¼ G 2D , to 3D conformations, represented by C ¼ G 3D , and vice versa.
For the 2D graph T to 3D conformation C generation task, the forward SDE can be denoted as: For the 3D conformation C to 2D graph T generation task, the forward SDE can be formulated as: where f 1 ; f 2 ; f 3 and g 1 ; g 2 ; g 3 denote the drift and diffusion coefficients and w t ; w 1 t ; w 2 t represents independent Brownian motions, and we set these hyperparameters and score networks for diffusion process same as previous work (Liu et al. 2023a).
We utilize the GTA-2D and GTA-3D as our conditional encoder to encode 2D graphs and 3D conformations.
For the 2D graphs G 2D to 3D conformations G 3D generation task, the score network

Training objectives
To maximize the MI L MI , GTAM has several pretraining tasks: (i) contrastive learning between 2D and 3D molecular representations; (ii) generation of 3D conformations based on 2D graphs, to create 3D molecular structures from 2D graphical inputs; (iii) generation of 2D graphs conditional upon 3D conformations, emphasizing the derivation of 2D topological graphs from 3D geometric conformations.
First, the contrastive objective can be formulated as: where p t T t jT; C ð Þ and p t C t jC; T ð Þ are the noise distribution during diffusion process and σ is the sigmoid function.
Expanding upon the foundational concept of contrastive learning of node embeddings, we incorporate two innovative contrastive training objectives specifically designed to enhance the transfer and integration of edge information across 2D graphs and 3D conformations.These objectives, the contrastive loss for 3D edge L 3Dedge and the contrastive loss for 2D edge L 2Dedge , collectively form the contrastive objective for edge information, defined as: These objectives are designed to enhance the fidelity and relevance of edge embeddings across different molecular representations by implementing direct edge constraints.
For the contrastive loss for the edges in 2D graphs, we employ the 3D conformation distance map as a label and project the edge embeddings from the 2D graphs to predict this distance map.We discretize the distance between atoms u and v in 3D conformations into 32 bins ranging from 1.0 Å to 2.4 Å, with this specific range being determined based on estimates derived from our pretraining dataset, encoding them into one-hot vectors d 3D uv .The edge embeddings are then projected into these 32 distance bins to obtain bin probabilities p 2D uv using softmax.The loss is calculated as follows: For the contrastive loss for the edges in 3D conformations, the chemical bond attributes from 2D graphs are used as labels y 2D uv , with an additional label for the absence of a chemical bond.We project the pair embeddings z 3D uv from 3D conformations to predict these labels, employing a cross-entropy loss for the classification task: The second objective is the conditional generation from 2D graphs to 3D conformations.The goal is to use S 2D !3D θ to estimate the score r Ct log p t C t jC; T ð Þ.To learn p CjT ð Þ, based on the score network, the training objective is: The third objective is reconstructing the 2D graphs from 3D conformations, p xjy ð Þ.The goal is to use The training objective for learning p TjC ð Þ can be formulated as: The comprehensive training objective for the GTAM encapsulates a multifaceted approach to molecular representation learning, and the overall training objective is articulated as follows:

Experiments
We pre-trained our model on PCQM4Mv2, where molecules are represented by both 2D graphs and corresponding 3D conformations.We evaluated 28 distinct tasks to ascertain the effectiveness of our method, categorized as follows: (i) eight molecular property tasks utilizing solely 2D topological graphs in the MoleculeNet dataset; (ii) 20 regression tasks based on 3D geometric conformations in the QM9 and MD17 datasets.We also use molecular retrieval tasks to validate our model's practical application.Finally, we conduct extensive ablation studies to investigate the impact of geometric triangle awareness update methods.A detailed description of the experiment and the datasets can be found in Supplementary Appendices A and B.

Dataset
PCQM4Mv2 dataset has 3.38M molecules in total.It is a quantum chemistry dataset originally curated under the PubChemQC (Nakata and Shimazaki 2017) project with both 2D topological graphs and 3D geometric conformations.MoleculeNet (Wu et al. 2018) is a large-scale benchmark for molecular machine learning, which curates multiple public datasets and establishes metrics for evaluation.We choose eight binary classification tasks on MoleculeNet: BACE, BBBP, ClinTox, HIV, SIDER, Tox21, ToxCast, and MUV.Most of them are with limited data.We choose the Toxcast dataset in the MoleculeNet dataset to conduct molecular retrieval experiments.
MD17 (Chmiela et al. 2017) is a dataset for molecular dynamics simulations.It comprises eight different organic molecules, each corresponding to a specific task.It aims to predict the energy-preserving forces for each atom in the molecule.QM9 (Ramakrishnan et al. 2014) is a subset of the GDB-17 database and comprises 134 thousand stable organic molecules consisting of nine heavy atoms.It has 12 tasks related to the quantum properties.
For 3D conformation tasks, we follow the baselines from GeoSSL (Liu et al. 2022).The study introduces four coordinate-MI-unaware SSL baselines, two contrastive SSL baselines, and one generative SSL baseline, predicting three key molecular aspects: identifying hidden atom types, estimating distances between atom pairs, and calculating angles among atom triplets.3D InfoGraph discerns whether node and graph-level 3D representations correspond to the same molecule.We utilize three distinct objective functions, including RR, InfoNCE, and EBM-NCE, to maximize the MI between the conformations and its augmented counterpart.We also evaluate our method by comparing it with several notable models including GraphMVP, 3D InfoMax, Zaidi et al. (Zaidi et al. 2022), GeoSSL, and MoleculeSDE.For these datasets, we established no pretraining baselines, building models with random weight initialization without pretraining, to comparatively assess the effectiveness of pretraining strategies by evaluating model performance pre-and post-pretraining.

Results on 2D graph tasks
For each task, we split the dataset into training, validation, and test sets by 8:1:1 according to molecular scaffolds according to the scaffold split.The results of the classification tasks on MoleculeNet are summarized in Table 1.We can observe that our work demonstrates superior performance in molecular property prediction when compared to previous methods.
To validate the effectiveness of our molecular encoders, the results of GTAInfomax are also presented in Table 1 and have achieved impressive outcomes that are better than some previous works such as 3DInfomax.Additionally, according to Table 1, some pretraining baselines such as ContextPred and AttrMask perform similarly or even worse than our no pretraining baseline, which can demonstrate the superior performance of our molecular encoder in supervised learning.Moreover, GTAM's performance on these datasets exceeded that of our no pretraining baseline, further illustrating the effectiveness of the pretraining phase in enhancing performance in downstream tasks.These indicate the effectiveness of our method in extracting molecular feature information and integrating information between 2D graphs and 3D conformations modalities.

Results on 3D conformation tasks
In the MD17 dataset, following previous works (Sch€ utt et al. 2017, Gasteiger et al. 2020a), we use 1K molecules for finetuning, 1K for validation, and 48-991K for testing across a variety of tasks.The goal of the MD17 is to predict the energy-conserving interatomic forces for each atom at each molecule position.We can observe that GTAM works best on seven out of eight in Table 2.For the QM9 dataset, we take 110 thousand molecules for the training set, 10 thousand for validation, and 11 thousand for the testing segment.The results, displayed in Table 3, can reach the best performance on 7 tasks.
We also tested the GTAInfomax on the QM9 dataset and MD17 dataset, and it demonstrates superior performance compared to other models.Moreover, compared to our no pretraining baseline, some pretraining baselines show similar or poorer results on the 20 3D conformation tasks.These results highlight the efficacy of our GTA-3D.More results about GTAM can be found in Supplementary Appendix C.

Results on molecular retrieval
We also conducted molecular retrieval experiments to show the model's ability to obtain representations with chemical significance for practical applications.We conducted the case study on the Toxcast dataset.In this experiment, we calculate the cosine similarities between the 256-dimensional embeddings of all molecules in the test set and a designated query molecule.Each molecule's output embedding is represented as a 256-dimensional vector.Additionally, we assessed the Tanimoto similarity among the extended-connectivity fingerprints (ECFPs) of the query molecule and those in the test set.Figure 2 shows the top four molecules identified by our model and MoleculeSDE based on cosine similarity to the query molecule.This observation suggests that GTAM's molecular representations exhibit a higher correlation with molecular fingerprints, demonstrating the model's ability to capture and reflect the intrinsic chemical characteristics of molecules effectively.It shows that the molecular representations obtained by GTAM have a high consistency with the molecular fingerprint.The results confirm the effectiveness of our model in practical applications, as well as its superior molecular feature extraction capabilities.

Conclusions
In our study, we introduce GTAM, an innovative contrastive learning approach for molecular representation learning.This method not only can capture complex edge relationships and the multifaceted nature of molecular structures using a novel molecular encoder but also implements two designed loss functions to enhance joint learning.Through extensive experiments on eight molecular property prediction tasks and 20 molecular conformation tasks, GTAM has achieved stateof-the-art results in several of these.This demonstrates that GTAM is capable of efficiently extracting molecular feature information and integrating data across 2D graphs and 3D conformations modalities.While these results are encouraging, there are still challenges to address.We will expand our pre-trained model's application to a wider array of scenarios, such as predicting drug-drug and drug-target interactions.This expansion aims to further demonstrate the model's versatility and potential impact in the field of pharmaceutical research.

Key points
� GTAM has achieved state-of-the-art results in both 2D and 3D molecular representation tasks, outperforming existing methods in 22 out of 28 tasks.� GTAM introduces a new contrastive learning framework for molecular representation, incorporating innovative molecular encoders and a unique geometric triangle awareness mechanism.� GTAM establishes two new training objectives that integrate edge information supervision during pretraining.This facilitates effective cross-modal information transfer and significantly boosts the model's expressive capabilities.

Figure 1 .
Figure 1.Overview of GTAM: The GTAM framework integrates a geometric triangle awareness updating mechanism along with innovative training objectives.This approach is designed to enhance molecular representation through a comprehensive understanding of the geometric interrelationships within molecular structures

Figure 2 .
Figure 2. The image showcases the query molecule along with the four nearest molecules and their respective extracted representations.Tanimoto similarity scores, depicted beneath each molecule, quantify their chemical resemblance to the query molecule.The highlighted red sections denote the results from MoleculeSDE, while the blue sections represent the results from GTAM

Table 1 .
Results on MoleculeNet dataset with 2D topological graphs only.For each downstream task, we present the mean ROC-AUC (with standard deviation) across three seeds, using scaffold splitting.The best and second-best results are marked in bold and underlined, respectively.

Table 2 .
Results on eight atomic forces predictions on the MD17 dataset (in kcal/mol/Å) with 3D geometric conformations only.The evaluation is mean absolute error.The best and second-best results are marked in bold and underlined, respectively.

Table 3 .
Results on 12 energy predictions on the QM9 dataset using 110K for training with 3D geometric conformations only.The evaluation is mean absolute error.The best and second-best results are marked in bold and underlined, respectively.