2.1 Materials
Our approach begins by pretraining on a substantial amount of unlabeled molecular structure data, followed by fine-tuning on a limited set of labeled molecular data for downstream tasks to predict molecular properties. The unlabeled molecular data we utilize is sourced from the ZINC15 dataset, comprising approximately 250 thousand unlabeled molecular instances.
In the downstream tasks, we employed six classification datasets obtained from MoleculeNet21. All molecules in these datasets are represented as SMILES strings. We utilized the open-source cheminformatics software RDKit22 to convert SMILES formats into molecular representations, and further transformed them into the required molecular graph and molecular fingerprint representations for our model. In our study, we employed scaffold splitting23, which partitions the dataset based on molecular scaffolds. All molecular datasets were split into training, validation, and test sets in a ratio of 8:1:1. Details regarding the six datasets and a summary of the ZINC15 dataset are provided in Table 1. Subsequently, we elaborate on the molecular properties of the molecules in these downstream fine-tuning datasets.
BBBP dataset24: The dataset is specifically designed for modeling and predicting barrier permeability. It includes labels indicating whether compounds can penetrate the blood-brain barrier.
BACE dataset25: The dataset provides quantitative (IC50) and qualitative (binary labels) binding results for a set of human β-secretase 1 (BACE-1) inhibitors.
ClinTox dataset26,27: The dataset comprises qualitative data on drugs approved by the United States Food and Drug Administration (FDA) and drugs that did not pass clinical trials due to toxicity reasons.
SIDER dataset28: The dataset includes information about adverse reactions of marketed drugs and their records. This information is extracted from public documents and package inserts. Available information includes the frequency of side effects, drug and side effect classifications, as well as links to additional information, such as drug-target relationships.
Tox21 dataset29: Toxicology in the 21st Century, abbreviated as Tox21. This dataset is a public database that measures the toxicity of compounds. It includes toxicity measurements for 8,000 compounds against 12 different targets.
ToxCast dataset21: Extended data collection from the same initiative as Tox21, involving toxicology data based on high-throughput screening of a large compound library. This dataset includes experimental data for over 600 tasks.
Table 1
Basic Information of ZINC15 and 6 Benchmark Datasets
Dataset | Category | Size of molecules |
BBBP | Physiology and toxicity | 2,039 |
BACE | bioactivity and biophysics | 1,513 |
ClinTox | physiology and toxicity | 1,478 |
SIDER | physiology and toxicity | 1,427 |
Tox21 | physiology and toxicity | 7,831 |
ToxCast | Physiology and toxicity | 8,615 |
ZINC15 | / | 250,000 |
2.2 GFC framework architecture
2.2.1 overview
This framework consists primarily of two phases: the pretraining phase and the downstream task fine-tuning phase. Initially, in the pretraining phase, we leverage unlabeled molecular data for unsupervised pretraining. During this process, unlabeled molecular data in SMILES encoding format is transformed into molecular graph representations and input into our graph neural network to obtain graph embedding representations. Subsequently, the graph embedding representations of molecules are clustered to acquire cluster center molecules. Following this, we employ the pretraining data along with the graph representations and molecular fingerprints of the newly obtained cluster center molecules for contrastive learning. The second phase involves the fine-tuning of downstream tasks. Following the completion of the contrastive learning pretraining, we obtain trained model weights. Subsequently, these weights are used for fine-tuning on a labeled molecular dataset for downstream tasks. By connecting the pretrained weight network to fully connected layers and activation layers, we ultimately obtain classification results.
2.2.2 Obtaining cluster centered molecules
In this framework, the SMILES representations of molecules from the unlabeled dataset are transformed into a graph format using RDkit. Before commencing pretraining, the graph neural network is employed to obtain embedding representations for each molecule, denoted as \({m}_{i}=({x}_{1},{x}_{2},\dots ,{x}_{n})\). Subsequently, after obtaining these molecular embedding representations, the K-Means algorithm is applied to perform clustering, resulting in \(k\) clusters. Each cluster provides representations and indices for cluster center molecules, denoted as \({m}_{c1},{m}_{c2},\dots ,{m}_{ck}\). We utilize t-SNE30 for dimensionality reduction to achieve a visual representation, as depicted in Fig. 1. The process flowchart for obtaining cluster center molecules is illustrated in Fig. 2.
2.2.3 Similarity calculation of molecular representation
Subsequently, we compute the similarity between the molecular fingerprints of the pretraining data and the graph embeddings obtained through the graph neural network, in comparison to the molecular fingerprints and graph embeddings of the cluster center molecules. By calculating the similarity scores for molecular fingerprint and graph embedding, we conduct contrastive learning. In the process of similarity computation within this framework, we employ the Tanimoto coefficient formula to calculate the pairwise similarity between molecular fingerprints, as illustrated in Eq. (1). Here, the molecular fingerprint is represented as a binary vector composed of 0s and 1s, where a denotes the count of 1s in the compared molecular fingerprint, b represents the count of 1s in the reference molecular fingerprint, and c signifies the count of positions where both molecular fingerprints have a value of 1.
$${T}_{st}=\frac{a}{a+b-c}$$
Since the Tanimoto coefficient's similarity \({T}_{st}\in \left[\text{0,1}\right]\), we aim for the similarity computed between molecular embeddings obtained through graph neural networks to also fall within the range close 0 to 1. To meet this constraint, we have designed the similarity computation formula for graph embeddings. The first step involves finding the sub-maximal spatial vector distance, which can be obtained during the phase when molecular graphs undergo processing through the graph neural network, as described in the preceding section. In each mini-batch, we subtract the embedding representation of each molecule pairwise at corresponding positions to obtain absolute distances. These distances are then summed to yield the sub-maximal spatial vector distance, denoted as \(disma{x}^{*}\). In the graph neural network's treatment of molecular graphs, we can compute spatial vector distances between pairs of molecules in each mini-batch. This spatial vector distance may not be the maximum across the entire dataset, hence its designation as sub-maximal spatial vector distance. The formula for calculating similarity using graph embeddings is as follows:
$$sm{i}_{G}=\frac{\sum |{m}_{i}-{m}_{ci}|}{disma{x}^{*}}$$
2.2.4 Contrastive Learning of Molecular Fingerprints and Molecular Graph Embeddings
To thoroughly extract structural information from unsupervised learning, we employ a contrastive learning approach. We posit that pre-training methods involving setting pseudo-labels, such as using masks to obscure features related to atom types or bonds in molecules and attempting to predict these features during training, may potentially compromise the semantic information of chemical molecules. Simultaneously, we recognize that molecular fingerprints are an effective method for converting semantic information in chemical molecules into vector representations. Morgan fingerprints generate bit-vector representations by considering local information in molecules. When the radius length for the local scope is set to 1, what is effectively obtained is the structural information of the molecule. The radius length, in the context of generating Morgan fingerprints, refers to the number of chemical bonds each atom extends beyond its own. It's important to note that this length does not refer to a physical distance unit but rather to the number of connected levels of atoms. This approach is more conducive to extracting the topological structure information of molecules31.
After computing the similarity between the graph embeddings of the molecule to be pre-trained and several centroid molecules, as well as the fingerprint similarities between them, we adjust the graph embedding similarity and molecular fingerprint similarity through loss optimization. To address the issue of graph embedding similarity, we employ the Mean Squared Error (MSE) Loss function for calculating and optimizing the loss value. Through training iterations, we ultimately obtain the graph embedding representation of the molecule.
2.2.5 Graph Self-Attention Pool
Today, attention mechanisms play a crucial role in artificial intelligence, finding applications across various aspects of neural network modules. In our work, we applied the self-attention mechanism to the graph pooling process. During the computation of self-attention, input features are mapped into query matrix Q, key matrix K, and value matrix V (all trainable parameters) 32. In the graph pooling process, we consider that each graph representation after the feature extraction through the graph neural network might influence the final graph representation. Therefore, instead of adopting conventional methods of taking only the last layer of the graph representation after feature extraction from the graph neural network, or simply adding them all without considering weights, we use self-attention mechanisms. This allows for selective aggregation across each layer of graph representation after feature extraction, forming a comprehensive graph representation that is subsequently used for fine-tuning in downstream tasks.
During the pre-training process, we employ a graph neural network. In each readout phase of the graph neural network, typically, we either choose to use the last feature representation from the update phase in the pooling operation or sum up the feature representations from various update phases. Due to the black-box nature of neural networks, determining which layer's feature representation contributes the most to the final result is challenging. Therefore, after each aggregation operation of the graph neural network at each layer, we introduce self-attention coefficients to leverage the powerful fitting capability of the neural network in finding the most suitable graph embedding representation. The formula is as follows:
$$\begin{array}{c}{h}_{i}=pool\left({h}_{0},{h}_{1},\dots ,{h}_{m}\right)\#\left(7\right)\end{array}$$
$$\begin{array}{c}pool\left({h}_{0},..,{h}_{m}\right)={\sum }_{i=0}^{i=m}softmax\left(\frac{{\left(W{h}_{i}\right)}^{T}\left(W{h}_{i}\right)}{\sqrt{{d}_{h}}}\right){h}_{i}\#\left(8\right)\end{array}$$
2.3 Baseline
To showcase the performance of our designed model, we compare it with several other methods, including those that also undergo downstream task fine-tuning after pre-training. The benchmark models for comparison in this paper are as follows:
MoCL33: Combined with two different contrastive strategies, the local-level strategy contrasts representations encoded by two graph augmentations, while the global-level strategy contrasts mutual information between similar graphs.
MolCLR18: Three molecular augmentation methods, including atom masking, bond deletion, and subgraph deletion, were employed, and a comparison among these different augmentation methods was conducted.
GraphCL34: This method generates two L-hop subgraphs for a node with random perturbations and employs self-supervised learning by maximizing the similarity between the two generated subgraphs.
JOAO35: A data augmentation method based on Graph Contrastive Learning (GraphCL) with an automatic selection framework was designed. The overall idea involves iteratively training various data augmentation methods using adversarial training to obtain a probability matrix and subsequently replacing the projection head in GraphCL accordingly.
PretrainGNN36: In order to pre-train the Graph Neural Network (GNN) encoder, we employed context graph representations of nodes predicted through neighborhood structure and molecular biomedical measurements.
GCC37: By using subgraph instance discrimination as the pretext task, we conducted contrastive learning to capture structural patterns that are universal and transferable.
GPT-GNN38: By using subgraph instance discrimination as the pretext task, we conducted contrastive learning to capture structural patterns that are universal and transferable.
GROVER39: Integrating message-passing networks into the Transformer architecture, we pre-trained a model with 100 million parameters on a dataset of 10 million molecules.
MGSSL40: This method introduces a sequence-based semi-supervised learning (SSL) generation framework. The framework pre-trains a Graph Neural Network (GNN) to capture molecular sequence information. The construction rules for motifs are enhanced, leveraging the GNN backbone to encode molecular graph representations. The prediction of motifs is based on the given order on the graph, utilizing either depth-first search or breadth-first search.
GraphLoG41: By aligning similar subgraphs to preserve local similarity and introducing hierarchical prototypes to achieve global semantic structure.
TopExpert42: Learning specific molecular topological structures for predicting molecular properties.
HiMol43: Proposed a hierarchical molecular graph neural network that considers both molecular structure and information from molecular substructures for predicting molecular properties.
2.4 Evaluation Metrics and Experimental Parameters
Due to the usage of classification datasets in our work, we employed the calculation of the Area Under the Receiver Operating Characteristic Curve (ROC-AUC) as the evaluation metric to measure the experimental performance. A larger ROC-AUC value indicates better performance.
All experiments were conducted on a server equipped with Nvidia A100 GPU and Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz running Linux. During the pretraining phase, our pretrained model was trained for 100 epochs using the MESLoss and Adam optimizer with a learning rate of 0.001. The model architecture utilized a 5-layer GIN with a dropout rate of 0.5. The batch size was set to 512 with 4 threads, and the dimension of the graph embedding was set to 300.
In the downstream tasks, our classifier consists of two fully connected layers and one ELU activation layer. Both the optimization of weights for the pretrained model and the fine-tuned model in downstream tasks utilized the Adam optimizer. However, different loss functions were employed for downstream tasks, specifically using BCEWithLogitsLoss. Learning rates were set individually. The batch size was maintained at 512 with 4 threads. The fine-tuning of each downstream task dataset was conducted for 200 epochs. Detailed parameter settings for each downstream task dataset are presented in the table.