Integrating Molecular Graphs and Fingerprints through Contrastive Learning for Enhanced Molecular Property Prediction

doi:10.21203/rs.3.rs-3690402/v1

Download PDF

Article

Integrating Molecular Graphs and Fingerprints through Contrastive Learning for Enhanced Molecular Property Prediction

https://doi.org/10.21203/rs.3.rs-3690402/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The molecular structure is closely linked to its properties. While graph representations of molecules have become popular due to the non-Euclidean nature of compound structures, they may not encompass as rich semantic information as molecular sequence representations. This can lead to potential conflicts in semantic features between different representations within neural networks. To address this issue, we propose a contrastive learning framework that combines molecular graphs with molecular fingerprints. Firstly, we employ clustering algorithms to group molecules and obtain cluster centers. Subsequently, we utilize these cluster centers for contrastive learning, allowing the model to learn molecular structural information on unlabeled data. Additionally, we introduce a self-attention mechanism during the graph pooling process to selectively extract graph features. Experimental results demonstrate that our model achieves an average improvement of 2.04% in ROC-AUC over the previous state-of-the-art models in molecular property classification tasks, validating the effectiveness of our computational framework.

Biological sciences/Drug discovery/Drug screening/Virtual screening

Biological sciences/Chemical biology/Computational chemistry

Drug development is a time-consuming and resource-intensive process¹. In the traditional drug development pipeline, researchers are required to conduct complex experiments on cells and animals in the laboratory to gather data. While this process consumes a significant amount of time, effort, and funding, it may not always yield ideal results. The recent advancements in computational chemistry models offer the potential to address these challenges^2,3.

In recent years, artificial intelligence technology has rapidly advanced, especially in the fields of computer vision and natural language processing, leading to the emergence of influential concepts and deep learning models^4–8. Building upon the successful application of deep learning methods in natural language processing, researchers have begun to explore the application of these methods to molecular representations of drug molecules for drug discovery. For instance, molecules can be represented as strings using formats like SMILES⁹ or SELFIES¹⁰. Subsequently, these serialized molecular representations are fed into natural language processing models. Researchers employ supervised learning to train neural networks for predicting molecular properties, where molecular property values serve as labels. Jaeger et al.¹¹ explored the conversion of molecules into vector formats. Jiang et al.¹² utilized Transformer and GRU models to capture both local and global information of molecules. Song et al.¹³ developed an end-to-end dual-head Transformer network to enhance the accuracy of molecular property prediction. However, serialized molecular representation methods are prone to losing structural information, which is crucial for understanding molecular properties. As a result, the interpretability and expressiveness of such representation methods are limited¹⁴.

Molecules can be naturally represented as graph structures, with atoms of the molecule serving as nodes and chemical bonds as edges. Inspired by the ability of convolutional neural networks in the field of computer vision to extract latent feature information, graph convolutional neural networks (GCNs) have the capacity to extract local structural features from such non-Euclidean data¹⁵. Ye et al. introduced the MSSGAT network, considering the connections between substructures and the overall molecule to effectively handle substructure features¹⁶. Luis Torres et al. focused on key atoms/groups, proposing an iterative attention network to generate stepwise interpretable predictions of molecular properties¹⁷. Despite the progress made by graph structures and graph neural networks in extracting features from non-Euclidean data structures, capturing specific chemical features such as the chirality of atoms remains challenging¹⁴.

In the field of chemoinformatics, labeled chemical molecule datasets are not only expensive but also scarce, while unlabeled datasets are abundant but remain underutilized. To better leverage these unlabeled data, many studies have adopted a two-step approach: first, unsupervised learning on the unlabeled dataset to capture intrinsic features of the data, followed by fine-tuning and model performance evaluation on labeled datasets. For instance, Wang et al. conducted pretraining by enhancing molecular graphs to minimize differences between different molecules and simultaneously maximize the consistency among different representations of the same molecule¹⁸. Zang et al. introduced a hierarchical self-supervised learning pretraining framework for molecular graphs¹⁹. Tan et al. employed random masking of edges, reconstructing these masked edges during pretraining²⁰. However, it's worth noting that such self-supervised learning methods may potentially compromise the semantic information of chemical molecules during the pretraining process.

Although progress has been made in the use of graph representations and graph neural networks for handling molecular data structures, predicting molecular properties using deep learning still faces numerous challenges and issues. Key questions include how to effectively extract molecular structural features, how to seamlessly integrate sequential knowledge with graph structure information, and how to design unsupervised learning strategies for large amounts of unlabeled data. To address these challenges, we propose a self-supervised learning framework for predicting molecular properties called Integrating Molecular Graphs and Fingerprints through Contrastive Learning (GFC). This study initially employs a graph neural network to obtain embedding vectors for molecules. Subsequently, these embedding vectors undergo clustering using an unsupervised learning algorithm to identify several cluster center molecules. Finally, employing a contrastive learning training strategy, we compare the graph embeddings and molecular fingerprints of molecules with those of cluster center molecules, thereby completing the pretraining phase. Following pretraining, the obtained molecular embedding vectors are used to fine-tune downstream tasks. The classification prediction of molecular properties in downstream tasks is achieved using fully connected layers and activation layers.

2.1 Materials

Our approach begins by pretraining on a substantial amount of unlabeled molecular structure data, followed by fine-tuning on a limited set of labeled molecular data for downstream tasks to predict molecular properties. The unlabeled molecular data we utilize is sourced from the ZINC15 dataset, comprising approximately 250 thousand unlabeled molecular instances.

In the downstream tasks, we employed six classification datasets obtained from MoleculeNet²¹. All molecules in these datasets are represented as SMILES strings. We utilized the open-source cheminformatics software RDKit²² to convert SMILES formats into molecular representations, and further transformed them into the required molecular graph and molecular fingerprint representations for our model. In our study, we employed scaffold splitting²³, which partitions the dataset based on molecular scaffolds. All molecular datasets were split into training, validation, and test sets in a ratio of 8:1:1. Details regarding the six datasets and a summary of the ZINC15 dataset are provided in Table 1. Subsequently, we elaborate on the molecular properties of the molecules in these downstream fine-tuning datasets.

BBBP dataset²⁴: The dataset is specifically designed for modeling and predicting barrier permeability. It includes labels indicating whether compounds can penetrate the blood-brain barrier.

BACE dataset²⁵: The dataset provides quantitative (IC50) and qualitative (binary labels) binding results for a set of human β-secretase 1 (BACE-1) inhibitors.

ClinTox dataset^26,27: The dataset comprises qualitative data on drugs approved by the United States Food and Drug Administration (FDA) and drugs that did not pass clinical trials due to toxicity reasons.

SIDER dataset²⁸: The dataset includes information about adverse reactions of marketed drugs and their records. This information is extracted from public documents and package inserts. Available information includes the frequency of side effects, drug and side effect classifications, as well as links to additional information, such as drug-target relationships.

Tox21 dataset²⁹: Toxicology in the 21st Century, abbreviated as Tox21. This dataset is a public database that measures the toxicity of compounds. It includes toxicity measurements for 8,000 compounds against 12 different targets.

ToxCast dataset²¹: Extended data collection from the same initiative as Tox21, involving toxicology data based on high-throughput screening of a large compound library. This dataset includes experimental data for over 600 tasks.

Table 1

Basic Information of ZINC15 and 6 Benchmark Datasets
Dataset	Category	Size of molecules
BBBP	Physiology and toxicity	2,039
BACE	bioactivity and biophysics	1,513
ClinTox	physiology and toxicity	1,478
SIDER	physiology and toxicity	1,427
Tox21	physiology and toxicity	7,831
ToxCast	Physiology and toxicity	8,615
ZINC15	/	250,000

2.2 GFC framework architecture

2.2.1 overview

This framework consists primarily of two phases: the pretraining phase and the downstream task fine-tuning phase. Initially, in the pretraining phase, we leverage unlabeled molecular data for unsupervised pretraining. During this process, unlabeled molecular data in SMILES encoding format is transformed into molecular graph representations and input into our graph neural network to obtain graph embedding representations. Subsequently, the graph embedding representations of molecules are clustered to acquire cluster center molecules. Following this, we employ the pretraining data along with the graph representations and molecular fingerprints of the newly obtained cluster center molecules for contrastive learning. The second phase involves the fine-tuning of downstream tasks. Following the completion of the contrastive learning pretraining, we obtain trained model weights. Subsequently, these weights are used for fine-tuning on a labeled molecular dataset for downstream tasks. By connecting the pretrained weight network to fully connected layers and activation layers, we ultimately obtain classification results.

2.2.2 Obtaining cluster centered molecules

In this framework, the SMILES representations of molecules from the unlabeled dataset are transformed into a graph format using RDkit. Before commencing pretraining, the graph neural network is employed to obtain embedding representations for each molecule, denoted as ${m}_{i}=({x}_{1},{x}_{2},\dots ,{x}_{n})$. Subsequently, after obtaining these molecular embedding representations, the K-Means algorithm is applied to perform clustering, resulting in $k$ clusters. Each cluster provides representations and indices for cluster center molecules, denoted as ${m}_{c1},{m}_{c2},\dots ,{m}_{ck}$. We utilize t-SNE³⁰ for dimensionality reduction to achieve a visual representation, as depicted in Fig. 1. The process flowchart for obtaining cluster center molecules is illustrated in Fig. 2.

2.2.3 Similarity calculation of molecular representation

Subsequently, we compute the similarity between the molecular fingerprints of the pretraining data and the graph embeddings obtained through the graph neural network, in comparison to the molecular fingerprints and graph embeddings of the cluster center molecules. By calculating the similarity scores for molecular fingerprint and graph embedding, we conduct contrastive learning. In the process of similarity computation within this framework, we employ the Tanimoto coefficient formula to calculate the pairwise similarity between molecular fingerprints, as illustrated in Eq. (1). Here, the molecular fingerprint is represented as a binary vector composed of 0s and 1s, where a denotes the count of 1s in the compared molecular fingerprint, b represents the count of 1s in the reference molecular fingerprint, and c signifies the count of positions where both molecular fingerprints have a value of 1.

$${T}_{st}=\frac{a}{a+b-c}$$

Since the Tanimoto coefficient's similarity ${T}_{st}\in \left[\text{0,1}\right]$, we aim for the similarity computed between molecular embeddings obtained through graph neural networks to also fall within the range close 0 to 1. To meet this constraint, we have designed the similarity computation formula for graph embeddings. The first step involves finding the sub-maximal spatial vector distance, which can be obtained during the phase when molecular graphs undergo processing through the graph neural network, as described in the preceding section. In each mini-batch, we subtract the embedding representation of each molecule pairwise at corresponding positions to obtain absolute distances. These distances are then summed to yield the sub-maximal spatial vector distance, denoted as $disma{x}^{*}$. In the graph neural network's treatment of molecular graphs, we can compute spatial vector distances between pairs of molecules in each mini-batch. This spatial vector distance may not be the maximum across the entire dataset, hence its designation as sub-maximal spatial vector distance. The formula for calculating similarity using graph embeddings is as follows:

$$sm{i}_{G}=\frac{\sum |{m}_{i}-{m}_{ci}|}{disma{x}^{*}}$$

2.2.4 Contrastive Learning of Molecular Fingerprints and Molecular Graph Embeddings

To thoroughly extract structural information from unsupervised learning, we employ a contrastive learning approach. We posit that pre-training methods involving setting pseudo-labels, such as using masks to obscure features related to atom types or bonds in molecules and attempting to predict these features during training, may potentially compromise the semantic information of chemical molecules. Simultaneously, we recognize that molecular fingerprints are an effective method for converting semantic information in chemical molecules into vector representations. Morgan fingerprints generate bit-vector representations by considering local information in molecules. When the radius length for the local scope is set to 1, what is effectively obtained is the structural information of the molecule. The radius length, in the context of generating Morgan fingerprints, refers to the number of chemical bonds each atom extends beyond its own. It's important to note that this length does not refer to a physical distance unit but rather to the number of connected levels of atoms. This approach is more conducive to extracting the topological structure information of molecules³¹.

After computing the similarity between the graph embeddings of the molecule to be pre-trained and several centroid molecules, as well as the fingerprint similarities between them, we adjust the graph embedding similarity and molecular fingerprint similarity through loss optimization. To address the issue of graph embedding similarity, we employ the Mean Squared Error (MSE) Loss function for calculating and optimizing the loss value. Through training iterations, we ultimately obtain the graph embedding representation of the molecule.

2.2.5 Graph Self-Attention Pool

Today, attention mechanisms play a crucial role in artificial intelligence, finding applications across various aspects of neural network modules. In our work, we applied the self-attention mechanism to the graph pooling process. During the computation of self-attention, input features are mapped into query matrix Q, key matrix K, and value matrix V (all trainable parameters) ³². In the graph pooling process, we consider that each graph representation after the feature extraction through the graph neural network might influence the final graph representation. Therefore, instead of adopting conventional methods of taking only the last layer of the graph representation after feature extraction from the graph neural network, or simply adding them all without considering weights, we use self-attention mechanisms. This allows for selective aggregation across each layer of graph representation after feature extraction, forming a comprehensive graph representation that is subsequently used for fine-tuning in downstream tasks.

During the pre-training process, we employ a graph neural network. In each readout phase of the graph neural network, typically, we either choose to use the last feature representation from the update phase in the pooling operation or sum up the feature representations from various update phases. Due to the black-box nature of neural networks, determining which layer's feature representation contributes the most to the final result is challenging. Therefore, after each aggregation operation of the graph neural network at each layer, we introduce self-attention coefficients to leverage the powerful fitting capability of the neural network in finding the most suitable graph embedding representation. The formula is as follows:

$$\begin{array}{c}{h}_{i}=pool\left({h}_{0},{h}_{1},\dots ,{h}_{m}\right)\#\left(7\right)\end{array}$$

$$\begin{array}{c}pool\left({h}_{0},..,{h}_{m}\right)={\sum }_{i=0}^{i=m}softmax\left(\frac{{\left(W{h}_{i}\right)}^{T}\left(W{h}_{i}\right)}{\sqrt{{d}_{h}}}\right){h}_{i}\#\left(8\right)\end{array}$$

2.3 Baseline

To showcase the performance of our designed model, we compare it with several other methods, including those that also undergo downstream task fine-tuning after pre-training. The benchmark models for comparison in this paper are as follows:

MoCL³³: Combined with two different contrastive strategies, the local-level strategy contrasts representations encoded by two graph augmentations, while the global-level strategy contrasts mutual information between similar graphs.

MolCLR¹⁸: Three molecular augmentation methods, including atom masking, bond deletion, and subgraph deletion, were employed, and a comparison among these different augmentation methods was conducted.

GraphCL³⁴: This method generates two L-hop subgraphs for a node with random perturbations and employs self-supervised learning by maximizing the similarity between the two generated subgraphs.

JOAO³⁵: A data augmentation method based on Graph Contrastive Learning (GraphCL) with an automatic selection framework was designed. The overall idea involves iteratively training various data augmentation methods using adversarial training to obtain a probability matrix and subsequently replacing the projection head in GraphCL accordingly.

PretrainGNN³⁶: In order to pre-train the Graph Neural Network (GNN) encoder, we employed context graph representations of nodes predicted through neighborhood structure and molecular biomedical measurements.

GCC³⁷: By using subgraph instance discrimination as the pretext task, we conducted contrastive learning to capture structural patterns that are universal and transferable.

GPT-GNN³⁸: By using subgraph instance discrimination as the pretext task, we conducted contrastive learning to capture structural patterns that are universal and transferable.

GROVER³⁹: Integrating message-passing networks into the Transformer architecture, we pre-trained a model with 100 million parameters on a dataset of 10 million molecules.

MGSSL⁴⁰: This method introduces a sequence-based semi-supervised learning (SSL) generation framework. The framework pre-trains a Graph Neural Network (GNN) to capture molecular sequence information. The construction rules for motifs are enhanced, leveraging the GNN backbone to encode molecular graph representations. The prediction of motifs is based on the given order on the graph, utilizing either depth-first search or breadth-first search.

GraphLoG⁴¹: By aligning similar subgraphs to preserve local similarity and introducing hierarchical prototypes to achieve global semantic structure.

TopExpert⁴²: Learning specific molecular topological structures for predicting molecular properties.

HiMol⁴³: Proposed a hierarchical molecular graph neural network that considers both molecular structure and information from molecular substructures for predicting molecular properties.

2.4 Evaluation Metrics and Experimental Parameters

Due to the usage of classification datasets in our work, we employed the calculation of the Area Under the Receiver Operating Characteristic Curve (ROC-AUC) as the evaluation metric to measure the experimental performance. A larger ROC-AUC value indicates better performance.

All experiments were conducted on a server equipped with Nvidia A100 GPU and Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz running Linux. During the pretraining phase, our pretrained model was trained for 100 epochs using the MESLoss and Adam optimizer with a learning rate of 0.001. The model architecture utilized a 5-layer GIN with a dropout rate of 0.5. The batch size was set to 512 with 4 threads, and the dimension of the graph embedding was set to 300.

In the downstream tasks, our classifier consists of two fully connected layers and one ELU activation layer. Both the optimization of weights for the pretrained model and the fine-tuned model in downstream tasks utilized the Adam optimizer. However, different loss functions were employed for downstream tasks, specifically using BCEWithLogitsLoss. Learning rates were set individually. The batch size was maintained at 512 with 4 threads. The fine-tuning of each downstream task dataset was conducted for 200 epochs. Detailed parameter settings for each downstream task dataset are presented in the table.

To validate the performance of our unsupervised learning approach, we conducted molecular property prediction experiments on six classification datasets. We compared our method with several baseline models. To ensure the effectiveness of the proposed modules, ablation experiments were performed. Finally, a series of interpretability studies were conducted to illustrate the significance of this research.

3.1 Experimental Result

In this study, we selected six classification datasets from MoleculeNet to validate the model's performance. Table 1 shows the average and standard deviation of ROC-AUC values for the classification datasets. The following observations can be made from the table: Our model demonstrated absolute advantages on three out of the six classification datasets. Performance on the ToxCast dataset was comparable to the best-performing models. There is a significant difference in performance on the ClinTox dataset compared to the MolCLR model, but MolCLR performed well on this dataset.On the Tox21 dataset, our model slightly lags behind TopExpert but outperforms it on other datasets. In summary, based on the average values, our proposed model shows a 2% improvement over the best baseline models.

Table 1

ROC-AUC Performance (%) on Benchmark Datasets
Method\|Dataset		BACE	Tox21	SIDER	ToxCast	ClinTox	Avg.
MolCLR	69.3 ± 0.5	76.5 ± 0.5	74.2 ± 0.8	56.4 ± 0.3	55.0 ± 1.3	90.4 ± 1.7	70.3
MoCL	66.8 ± 0.1	75.1 ± 0.1	70.9 ± 0.2	61.2 ± 0.1	60.7 ± 0.1	60.8 ± 0.1	65.9
GraphCL	69.5 ± 2.6	72.8 ± 5.4	75.0 ± 0.3	61.4 ± 1.3	63.2 ± 0.4	78.9 ± 4.2	70.1
JOAO	70.7 ± 0.6	72.2 ± 2.0	75.5 ± 0.7	61.1 ± 0.9	61.6 ± 0.6	79.6 ± 3.7	70.1
PretrainGNN	68.7 ± 1.3	84.5 ± 0.7	78.1 ± 0.6	62.7 ± 0.8	65.7 ± 0.6	72.6 ± 1.5	72.1
GCC	66.9 ± 0.7	75.0 ± 1.5	76.6 ± 0.5	58.0 ± 0.9	63.5 ± 0.4	73.2 ± 2.6	68.9
GPT-GNN	67.5 ± 1.3	78.5 ± 0.9	76.1 ± 0.4	59.3 ± 0.8	63.1 ± 0.5	74.9 ± 2.7	69.9
GROVER	70.0 ± 0.1	82.6 ± 0.7	74.3 ± 0.1	64.8 ± 0.6	65.4 ± 0.4	81.2 ± 3.0	73.0
MGSSL	73.1 ± 0.8	79.1 ± 0.9	76.5 ± 0.3	61.8 ± 0.8	64.1 ± 0.7	80.7 ± 2.1	72.0
GraphLoG	65.7 ± 1.4	79.0 ± 0.7	73.4 ± 0.3	57.3 ± 2.3	63.4 ± 0.4	72.5 ± 1.8	68.6
TopExpert	69.2 ± 2.5	84.9 ± 0.8	78.6 ± 0.3	62.9 ± 0.7	66.2 ± 0.5	74.1 ± 2.1	72.65
HiMol	73.2 ± 0.8	84.3 ± 0.3	76.2 ± 0.3	61.3 ± 0.5	66.3 ± 0.4	80.8 ± 1.4	73.7
GFC	75.8 ± 0.47	85.7 ± 0.2	77.4 ± 0.4	65.2 ± 0.3	66.1 ± 0.1	84.1 ± 2.4	75.7

3.2 Results of ablation experiment

To further validate the effectiveness of the proposed modules, we conducted a series of ablation experiments on the six datasets, individually removing the pre-training module and the self-attention graph pooling layer module to assess the efficacy of our proposed approach, as illustrated in Fig. 4.

To validate the importance of pre-training, we conducted an ablation study by removing the pre-training module and directly training the downstream tasks without pre-training. The results clearly indicate the significance of pre-training. Without pre-training, the performance of the GFC model is overall inferior to the original model. Notably, the pre-trained model demonstrates a faster convergence rate on most datasets. In terms of the ROC-AUC metric, the pre-trained model shows an improvement of approximately 4.7%. This underscores that extensive unsupervised learning enables the extraction of more structural features, contributing to an accurate representation of molecules.

Finally, to validate the effectiveness of the self-attention graph pooling layer, we conducted experiments by removing this module. In Table 2, we observe a decrease of 5.3% in AUC values for the classification task across six datasets when the model lacks the self-attention graph pooling layer. This finding provides evidence that incorporating attention during the pooling process after feature extraction in graph neural networks enhances the capture of global information in graph representations.

Table 2

Classification Performance of Different Ablation Models.
Dataset	BBBP	BACE	Tox21	SIDER	ToxCast	ClinTox
w/o pretrain	75.7	80.4	75.8	62.3	63.7	70.3
w/o attention pool	72.1	84.4	74.5	63.15	61.3	75.5
GFC	75.8	85.7	77.4	65.2	66.1	84.1

Apart from the marginal difference in performance observed in the ablation experiment without pre-training on the BBBP dataset, the metrics tested on other datasets have shown improvement. Thus, the absence of any module leads to a decrease in model performance. From these experiments, we can conclude that each proposed module contributes to enhancing the model's performance.

3.6 Visualization of experimental results

For visual confirmation of the effectiveness of our model, we conducted visualizations on two downstream tasks, BBBP and BACE datasets. We applied t-SNE to reduce the dimensionality of molecular embedding vectors obtained before and after pre-training. The results are shown in Fig. 6a and b, representing the visualizations of molecular embedding vectors before and after training for BBBP and BACE, respectively. Clearly, after training, molecules with different properties are closer to each other on the graph, demonstrating the effectiveness of our model.

3.7 Discussion

In this study, we conducted comparative experiments with other state-of-the-art (SOTA) models. The selected models follow a two-stage approach, first undergoing unsupervised pre-training to obtain pre-trained weights, and then fine-tuning on labeled molecular property datasets for downstream tasks. Table 1 displays the performance on six classification datasets from MoleculeNet. The results for our model represent the average performance and standard deviation over three or more independent tests, while the results for other benchmark models are cited from previously published papers.

Compared to previous SOTA models, our model demonstrates an average improvement of 2 in terms of AUC-ROC values across the six datasets. It achieved the best results on three datasets: BBBP, BACE, and SIDER. However, its performance on the ToxCast dataset was slightly inferior to the previous SOTA. On the Tox21 dataset, our experimental performance was only slightly lower than TopExpert, and it outperformed TopExpert on other datasets. On the last dataset, ClinTox, our performance was indeed lower than MolCLR, but MolCLR's performance was competitive only on this dataset, while our model outperformed it on other datasets.

In this work, our primary goal was to address the challenge of extracting sufficient structural features from molecular graphs. While acknowledging the importance of pharmacophores, functional groups, and other substructures in molecules, we emphasize the crucial interplay of different semantic information evolving into feature representations in neural networks. Our experimental findings can be summarized as follows:

Many self-supervised models use graph augmentation in the original molecular graph for pre-training or compare augmented molecular graphs with the original for contrastive learning. We argue that such graph augmentation methods may not increase the model's generalization ability, and in specific application contexts (e.g., molecular structures), they might even disrupt the semantic information of the original graph. MolCLR, GROVER, GraphCL, GCC, and other methods in the comparative experiments used graph augmentation, but their final performance was slightly inferior to ours.

While experimental evidence in chemical studies indicates the importance of pharmacophores, functional groups, and other substructures in molecules, we observed that considering the addition of substructure information to the model can lead to conflicts among different semantic information after feature extraction in a neural network. In our comparative experiments, models like HiMol, MoCL, MGSSL, which incorporated substructure information, did not perform as well as expected.

In conclusion, we proposed a method for contrastive learning between molecular graphs and molecular fingerprints, aiming to extract structural information adequately. The results show promising achievements in predicting molecular properties.

In this article, we propose a framework for molecular graph-molecular fingerprint contrastive learning to capture molecular features. Specifically, we introduce a contrastive learning strategy between molecular graphs and molecular fingerprints, primarily focusing on extracting structural information from molecules. Additionally, after feature extraction from molecular graphs using graph neural networks, we incorporate a graph self-attention pooling module. The inclusion of self-attention pooling after feature extraction enhances the representation of molecular graphs. We validate our model's performance on six classification datasets from MoleculeNet, demonstrating superior results compared to state-of-the-art (SOTA) models.

This indicates that our model exhibits enhanced capabilities in extracting features related to molecular structure. However, the proposed molecular graph-molecular fingerprint contrastive framework requires further research in the following aspects: (1) The impact of substructures in chemical molecules on molecular properties is crucial. Investigating how to incorporate substructure feature information into the framework while ensuring no conflicts in feature information is a worthwhile research problem. (2) With an increasing focus on extending molecular information from 2-D to 3-D, considering suitable data structures that can represent multidimensional information is essential for future research.

Competing interests

The authors declare no competing interests.

Data availability

In this paper, all the relevant datasets used are publicly available. The pretraining dataset on ZINC15 can be downloaded from https://github.com/zaixizhang/MGSSL/tree/main/motif_based_pretrain/data/zinc. All downstream task datasets can be downloaded from the MoleculeNet website.

Code availability

The implementation of GFC is publicly available at https://github.com/cilei/GFC.

Khanna, I. Drug discovery in pharmaceutical industry: productivity challenges and trends. Drug Discov. Today 17, 1088–1102 (2012). https://doi.org/10.1016/j.drudis.2012.05.007
Chen, H. M., Engkvist, O., Wang, Y. H., Olivecrona, M. & Blaschke, T. The rise of deep learning in drug discovery. Drug Discov. Today 23, 1241–1250 (2018). https://doi.org/10.1016/j.drudis.2018.01.039
Muster, W. et al. Computational toxicology in drug development. Drug Discov Today 13, 303–310 (2008). https://doi.org/10.1016/j.drudis.2007.12.007
Russell, S. J. & Norvig, P. Artificial intelligence a modern approach. (London, 2010).
Chua, L. O. & Roska, T. The CNN paradigm. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 40, 147–156 (1993).
He, K., Zhang, X., Ren, S. & Sun, J. in Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
Vaswani, A. et al. Attention is all you need. Advances in neural information processing systems 30 (2017).
Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. Improving language understanding by generative pre-training. (2018).
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of chemical information and computer sciences 28, 31–36 (1988).
Krenn, M. et al. SELFIES and the future of molecular string representations. Patterns (N Y) 3, 100588 (2022). https://doi.org/10.1016/j.patter.2022.100588
Jaeger, S., Fulle, S. & Turk, S. Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition. J Chem Inf Model 58, 27–35 (2018). https://doi.org/10.1021/acs.jcim.7b00616
Jiang, J. et al. TranGRU: focusing on both the local and global information of molecules for molecular property prediction. Appl Intell (Dordr) 53, 15246–15260 (2023). https://doi.org/10.1007/s10489-022-04280-y
Song, Y., Chen, J., Wang, W., Chen, G. & Ma, Z. Double-head transformer neural network for molecular property prediction. J Cheminform 15, 27 (2023). https://doi.org/10.1186/s13321-023-00700-4
Guo, Z., Yu, W., Zhang, C., Jiang, M. & Chawla, N. V. in Proceedings of the 29th ACM International Conference on Information & Knowledge Management 435–443 (2020).
Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
Ye, X.-b. et al. Molecular substructure graph attention network for molecular property identification in drug discovery. Pattern Recognition 128 (2022). https://doi.org/10.1016/j.patcog.2022.108659
Tian, Y., Wang, X., Yao, X., Liu, H. & Yang, Y. Predicting molecular properties based on the interpretable graph neural network with multistep focus mechanism. Brief Bioinform 24 (2023). https://doi.org/10.1093/bib/bbac534
Wang, Y., Wang, J., Cao, Z. & Barati Farimani, A. Molecular contrastive learning of representations via graph neural networks. Nature Machine Intelligence 4, 279–287 (2022). https://doi.org/10.1038/s42256-022-00447-x
Zang, X., Zhao, X. & Tang, B. Hierarchical molecular graph self-supervised learning for property prediction. Communications Chemistry 6, 34 (2023).
Tan, Q. et al. in Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining 787–795 (2023).
Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem Sci 9, 513–530 (2018). https://doi.org/10.1039/c7sc02664a
Bento, A. P. et al. An open source chemical structure curation pipeline using RDKit. J Cheminform 12, 51 (2020). https://doi.org/10.1186/s13321-020-00456-1
Yongye, A. B., Waddell, J. & Medina-Franco, J. L. Molecular scaffold analysis of natural products databases in the public domain. Chem Biol Drug Des 80, 717–724 (2012). https://doi.org/10.1111/cbdd.12011
Martins, I. F., Teixeira, A. L., Pinheiro, L. & Falcao, A. O. A Bayesian approach to in silico blood-brain barrier penetration modeling. J. Chem Inf. Model. 52, 1686–1697 (2012).
Subramanian, G., Ramsundar, B., Pande, V. & Denny, R. A. Computational modeling of β-secretase 1 (BACE-1) inhibitors using ligand based approaches. J. Chem Inf. Model. 56, 1936–1949 (2016).
Gayvert, K. M., Madhukar, N. S. & Elemento, O. A data-driven approach to predicting successes and failures of clinical trials. Cell chemical biology 23, 1294–1301 (2016).
Sharakhov, I. V., Artemov, G. N., Bondarenko, S. M., Shirokova, V. & Stegniy, V. N. Spatial Organization of Chromosomes in Malaria Mosquitoes. (2016).
Kuhn, M., Letunic, I., Jensen, L. J. & Bork, P. The SIDER database of drugs and side effects. Nucleic acids research 44, D1075-D1079 (2016).
Richard, A. M. et al. The Tox21 10K compound library: collaborative chemistry advancing toxicology. Chemical Research in Toxicology 34, 189–216 (2020).
Van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. Journal of machine learning research 9 (2008).
Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. J. Chem Inf. Model. 50, 742–754 (2010). https://doi.org/10.1021/ci100050t
Shaw, P., Uszkoreit, J. & Vaswani, A. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155 (2018).
Sun, M., Xing, J., Wang, H., Chen, B. & Zhou, J. MoCL: Data-driven Molecular Fingerprint via Knowledge-aware Contrastive Learning from Molecular Graph. KDD 2021, 3585–3594 (2021). https://doi.org/10.1145/3447548.3467186
You, Y. et al. Graph contrastive learning with augmentations. Advances in neural information processing systems 33, 5812–5823 (2020).
You, Y., Chen, T., Shen, Y. & Wang, Z. in International Conference on Machine Learning. 12121–12132 (PMLR).
Hu, W. et al. Strategies for pre-training graph neural networks. arXiv preprint arXiv:1905.12265 (2019).
Qiu, J. et al. in Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 1150–1160.
Hu, Z., Dong, Y., Wang, K., Chang, K.-W. & Sun, Y. in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1857–1867.
Rong, Y. et al. Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing Systems 33, 12559–12571 (2020).
Zhang, Z., Liu, Q., Wang, H., Lu, C. & Lee, C.-K. Motif-based graph self-supervised learning for molecular property prediction. Advances in Neural Information Processing Systems 34, 15870–15882 (2021).
Xu, M., Wang, H., Ni, B., Guo, H. & Tang, J. in International Conference on Machine Learning. 11548–11558 (PMLR).
Kim, S., Lee, D., Kang, S., Lee, S. & Yu, H. in Proceedings of the AAAI Conference on Artificial Intelligence. 8291–8299.
Zang, X., Zhao, X. & Tang, B. Hierarchical Molecular Graph Self-Supervised Learning for property prediction. Commun Chem 6, 34 (2023). https://doi.org/10.1038/s42004-023-00825-5

There is NO Competing Interest.

Download PDF

Version 1

posted

You are reading this latest preprint version

Integrating Molecular Graphs and Fingerprints through Contrastive Learning for Enhanced Molecular Property Prediction

Status:

Version 1

Abstract

Figures

1. Introduction

2. Materials and Methods

2.1 Materials

2.2 GFC framework architecture

2.2.1 overview

2.2.2 Obtaining cluster centered molecules

2.2.3 Similarity calculation of molecular representation

2.2.4 Contrastive Learning of Molecular Fingerprints and Molecular Graph Embeddings

2.2.5 Graph Self-Attention Pool

2.3 Baseline

2.4 Evaluation Metrics and Experimental Parameters

3. Result

3.1 Experimental Result

3.2 Results of ablation experiment

3.6 Visualization of experimental results

3.7 Discussion

4 Conclusion

Declarations

Competing interests

Data availability

Code availability

References

Additional Declarations

Status:

Version 1