MoleMCL: a multi-level contrastive learning framework for molecular pre-training

Abstract Motivation Molecular representation learning plays an indispensable role in crucial tasks such as property prediction and drug design. Despite the notable achievements of molecular pre-training models, current methods often fail to capture both the structural and feature semantics of molecular graphs. Moreover, while graph contrastive learning has unveiled new prospects, existing augmentation techniques often struggle to retain their core semantics. To overcome these limitations, we propose a gradient-compensated encoder parameter perturbation approach, ensuring efficient and stable feature augmentation. By merging enhancement strategies grounded in attribute masking and parameter perturbation, we introduce MoleMCL, a new MOLEcular pre-training model based on multi-level contrastive learning. Results Experimental results demonstrate that MoleMCL adeptly dissects the structure and feature semantics of molecular graphs, surpassing current state-of-the-art models in molecular prediction tasks, paving a novel avenue for molecular modeling. Availability and implementation The code and data underlying this work are available in GitHub at https://github.com/BioSequenceAnalysis/MoleMCL.

representations of a graph and its substructures of different granularities.In InfoGraph (Sun et al., 2020), substructures are nodes, edges, and triangles, while in GCC (Qiu et al., 2020), they are sampled subgraphs.A follow-up study, MVGRL (Hassani and Ahmadi, 2020) , performs node diffusion to generate an augmented molecular graph and contrasts the atom representations of one view with the molecular representations of another.The latter class constructs positive samples for contrastive learning by augmenting graph data.Early works like GRACE (Zhou et al., 2020) and GraphCL (You et al., 2020) introduced random perturbations to the graph structure to construct contrastive views.However, it's pointed out that for structurally sensitive graph data like molecules, minor perturbations might lead to significant property alterations.Simultaneously, to overcome the manual trial-and-error limitation of GraphCL, JOAO (You et al., 2020) employs a unified dual-level optimization framework to automatically select augmentations for specific graph data, with other similar adaptive augmentation methods being GCA (Zhu et al., 2021) and AD-GCL (Suresh et al., 2021).MoCL (Sun et al., 2021), aiming to preserve the semantic value of molecular graphs, incorporates expensive domain knowledge.Moreover, there are featurebased approaches.GASSL (Yang et al., 2021) introduces perturbations on init feature and hidden layer, while SimGRACE (Xia et al., 2022) applies Gaussian noise perturbations to the GNN encoder parameters.However, in our experiments, we found that SimGRACE exhibits severe negative transfer on molecular data, indicating that merely adding noise doesn't guarantee consistent effectiveness.Therefore, building upon SimGRACE, we propose a more scientific augmentation method by perturbing parameters using gradient compensation.

The algorithmic process of MoleMCL
The pseudocode of MoleMCL is shown in Algorithm 1. Generate masked graphs G m .

4:
Compute the feature representations z G and z Gm for G and G m .

5:
Compute L Mask and L CL1 to get L MaskGCL .

6:
Perturb the parameters of the GNN encoder using gradients from L CL1 .

7:
Generate new positive samples z ′ G using the perturbed GNN encoder.

8:
Compute the contrastive loss L PPGCL between z G and z ′ G .

9:
Update the parameters θ based on the combined loss L = L MaskGCL + L PPGCL .10: end for 11: return trained parameters θ. 3 The Influence of the hyperparameters In the context of two given hyperparameters, namely the Trade-off Hyperparameter (α) and the Gradient Weight (µ), we conducted systematic performance experiments.α is used to adjust the weighting of the impact between the node-level reconstruction task and the graph-level contrastive learning task in the MaskGCL module.It can be observed from the table 1 that when α increases from 0.3 to 0.5, the performance of the MoleMCL model significantly improves.However, as α further increases to 0.7, the performance experiences a decline.Therefore, selecting an appropriate α value is crucial, requiring a balance between the two tasks.
µ controls the magnitude of parameter perturbation, where a larger µ corresponds to a greater influence of gradients on parameter perturbation.The µ hyperparameter in Table 2 regulates the impact of gradients on parameter perturbation, with larger µ values leading to more significant perturbation effects.As µ increases from 2 to 6, the MoleMCL model's performance shows an upward trend, followed by a slight decline after reaching 8.This suggests that a moderate µ value contributes to enhancing model performance, but beyond a certain threshold, increasing µ may lead to a decrease in model performance.

More Results of Molecule Retrieval
We present additional molecular retrieval results in Figure 1, where MaskGCL+SimGRACE denotes training with a combination of two contrastive losses.It is noteworthy that the retrieval results of MaskGCL and MaskGCL+SimGRACE are not entirely satisfactory.Although MaskGCL+SimGRACE can identify more similar molecules compared to MaskGCL, it fails to categorize molecules in a chemically meaningful order.Hence, while the combination of methods at the feature level is crucial, it still requires careful design of appropriate enhancement strategies.Our proposed PPGCL, combined with MaskGCL, achieves accurate molecular retrieval.It not only identifies molecules with the highest similarity but also arranges them in a sequence that aligns with practical chemical significance.MoleMCL stands out with the shortest training time of 21 hours, showcasing its efficiency in learning molecular representations.In contrast, JOAOv2 and JOAO have longer training times of 30 and 32 hours, respectively, indicating higher computational demands due to their complex training frameworks.

Algorithm 1
The pre-training procedure of MoleMCL Input: Dataset D = {G i } N i=1 ; The masking ratio of attributes r m ; Output: Trained parameters θ. 1: Initialize the graph neural network parameters θ. 2: for each batch B in D do 3:

Figure 1 :
Figure 1: The query molecule alongside the 4 closest molecules with the extracted representations.The Tanimoto similarity scores, displayed below each molecule, quantify the chemical resemblance to the query molecule.

Table 1 :
The performance with various α on 8 MoleculeNet datasets.

Table 2 :
The performance with various gradient weight µ on 8 MoleculeNet datasets.

Table 3 :
Training times of pre-trained models Table 3 compares the training times of four pre-trained models conducted on the NVIDIA A100 80GB GPU.Among them, JOAO and JOAOv2 are based on an adversarial training framework with automatic data augmentation selection during training.Mole-BERT requires a two-step training process, first generating a vocabulary through VQVAE and then conducting molecular representation pre-training.On the other hand, MoleMCL, our model, directly explores molecules through multi-level contrastive learning.