Knowledge graph-enhanced molecular contrastive learning with functional prompt

Deep learning models can accurately predict molecular properties and help making the search for potential drug candidates faster and more efficient. Many existing methods are purely data driven, focusing on exploiting the intrinsic topology and construction rules of molecules without any chemical prior information. The high data dependency makes them difficult to generalize to a wider chemical space and leads to a lack of interpretability of predictions. Here, to address this issue, we introduce a chemical element-oriented knowledge graph to summarize the basic knowledge of elements and their closely related functional groups. We further propose a method for knowledge graph-enhanced molecular contrastive learning with functional prompt (KANO), exploiting external fundamental domain knowledge in both pre-training and fine-tuning. Specifically, with element-oriented knowledge graph as a prior, we first design an element-guided graph augmentation in contrastive-based pre-training to explore microscopic atomic associations without violating molecular semantics. Then, we learn functional prompts in fine-tuning to evoke the downstream task-related knowledge acquired by the pre-trained model. Extensive experiments show that KANO outperforms state-of-the-art baselines on 14 molecular property prediction datasets and provides chemically sound explanations for its predictions. This work contributes to more efficient drug design by offering a high-quality knowledge prior, interpretable molecular representation and superior prediction performance. Deep learning can be used to predict molecular properties, but such methods usually need a large amount of data and are hard to generalize to different chemical spaces. To provide a useful primer for deep learning models models, Fang and colleagues use contrastive learning and a knowledge graph based on the Periodic Table and Wikipedia pages on chemical functional groups.

Deep learning models can accurately predict molecular properties and help making the search for potential drug candidates faster and more efficient. Many existing methods are purely data driven, focusing on exploiting the intrinsic topology and construction rules of molecules without any chemical prior information. The high data dependency makes them difficult to generalize to a wider chemical space and leads to a lack of interpretability of predictions. Here, to address this issue, we introduce a chemical element-oriented knowledge graph to summarize the basic knowledge of elements and their closely related functional groups. We further propose a method for knowledge graph-enhanced molecular contrastive learning with functional prompt (KANO), exploiting external fundamental domain knowledge in both pre-training and fine-tuning. Specifically, with element-oriented knowledge graph as a prior, we first design an element-guided graph augmentation in contrastive-based pre-training to explore microscopic atomic associations without violating molecular semantics. Then, we learn functional prompts in fine-tuning to evoke the downstream task-related knowledge acquired by the pre-trained model. Extensive experiments show that KANO outperforms state-of-the-art baselines on 14 molecular property prediction datasets and provides chemically sound explanations for its predictions. This work contributes to more efficient drug design by offering a high-quality knowledge prior, interpretable molecular representation and superior prediction performance.
Molecular property prediction is widely considered one of the most important tasks in drug discovery. Traditional wet-lab experiments are time consuming and require a huge and incessant investment 1,2 . With artificial intelligence, researchers have studied molecular property prediction models to assess the clinical trial success rate and therapeutic potential of drug candidates, or even directly predict whether a compound will receive US Food and Drug Administration approval, substantially speeding up drug development and avoiding costly late-stage failures.
With the increasing availability of chemical experimental data, researchers have adopted pre-training models on extensive collections of unlabelled molecules, followed by fine-tuning on a limited number of labelled molecules for a specific task [3][4][5][6] . Most of these self-supervised learning (SSL) methods on molecules are purely data driven, focusing Figure 2a shows a snapshot of ElementKG, which consists of two levels: instance level and class level, coloured as red and blue, respectively. At the instance level, chemical elements and functional groups are represented as entities in ElementKG, denoted by red blocks. To record various chemical attributes of each element (for example, electron affinity and boiling point) and the composition of each functional group (for example, bond type), we apply data properties that attach literal data type values to an entity. The dotted block represents the data properties of the entity in the red block above it. Furthermore, as indicated by the red arrows, we establish associations between entities through object properties, such as chemical attribute relations between elements and the inclusion relations between elements and functional groups. We then classify all entities on the basis of their commonalities, resulting in the class level of ElementKG. Entities are assigned to the corresponding classes via rdf:type, denoted by dashed black arrows. The blue blocks represent different classes, while the blue arrows reflect the inclusion (rdfs:subClassOf) or disjointness (owl:disjointWith) between them. In particular, the subClassOf relations between classes form the class hierarchy, which serves as the backbone of ElementKG. The construction details can be found in Methods, and the statistics of ElementKG are displayed in Supplementary Information.
To comprehensively explore the structural and semantic information and obtain meaningful representations of all entities, relations and other components in ElementKG, we adopt a KG embedding approach based on OWL2Vec* (ref. 21). For further elaboration, please see Methods.
Contrastive-based pre-training. After obtaining ElementKG and its embeddings, we aim to incorporate it into pre-training to enhance the model's understanding of fundamental domain knowledge. We employ on exploiting the intrinsic information of molecular graphs without any prior chemical knowledge [7][8][9][10] . Moreover, with the enormous chemical space, these models rely heavily on pre-training datasets and may not generalize well to different downstream prediction tasks. Additionally, models that capture only the topology of molecular graphs and simple construction rules generally yield low interpretability. Therefore, it is important to leverage the fundamental chemical knowledge as a prior to guide the model to explore the chemical semantics of molecules at the microscopic level and discover meaningful patterns in both pre-training and fine-tuning.
As a typical SSL method, contrastive learning has attracted more research interest. To construct similar pairs and maximize agreement between them, existing methods rely on universal graph augmentation techniques that include node deletion, edge perturbation and subgraph extraction 11 . However, these techniques can be unsuitable for molecular graphs due to the considerable impact of adding or removing chemical bonds or atoms, which can alter the molecule's properties and identity 12 . Moreover, most existing methods consider only the connections between atoms established by chemical bonds, and thus do not fully explore the underlying relations of atoms in a molecular graph, which also highlights the key to incorporating external domain knowledge.
Another neglected issue is that the pre-training tasks differ greatly from the downstream tasks. Directly applying pre-trained representations to downstream tasks may result in suboptimal performance. In this Article, to address this, we propose providing a chemical prompt during fine-tuning based on fundamental chemical knowledge to bridge this gap. Inspired by prompt-tuning 13 , an emerging paradigm that has demonstrated remarkable performance on a wide range of natural language processing tasks [14][15][16][17] , it is crucial to devise appropriate prompts for molecular graphs based on fundamental chemical knowledge to enable more reliable predictions.
To this end, we propose a chemical element-oriented knowledge graph (ElementKG), which integrates basic knowledge of elements and functional groups in an organized and standardized manner. Then we exploit the contained fundamental chemical knowledge as a prior in both pre-training and fine-tuning, and propose a novel knowledge graph-enhanced molecular contrastive learning with functional prompt (KANO).
Firstly, we construct a chemical ElementKG based on the Periodic Table (https://ptable.com) and Wikipedia pages (https://en.wikipedia. org/wiki/Functional_group). ElementKG offers a comprehensive and standardized view from a chemical element perspective, which forms the foundation of our work. ElementKG covers the class hierarchy of elements, the chemical attributes of elements, the relationships between elements, the corresponding functional groups, and the connections between functional groups and their constituent elements.
Second, we introduce an element-guided graph augmentation in contrastive pre-training. Specifically, we augment the original molecular graph under the guidance of element knowledge in ElementKG, extracting rich relations between elements and associations between atoms that share the same element type but are not directly connected by chemical bonds. The resulting augmented graph respects the chemical semantics within molecules and establishes essential connections between atoms that go beyond the structural information. On top of this, a contrastive learning framework is developed to avoid indiscriminate implantation of external knowledge and to mitigate injection noise by allowing the two graph views to complement each other.
Third, we propose functional prompts to bridge the gap between pre-training contrastive tasks and downstream molecular property prediction tasks. As sets of atoms bonded together in a specific pattern, functional groups play a crucial role in determining the properties of the parent molecule 18 and are therefore closely related to downstream tasks. Therefore, in fine-tuning, we utilize the functional group Article https://doi.org/10.1038/s42256-023-00654-0 a contrastive learning method to pre-train a graph encoder on a large set of unlabelled molecules, using the basic element knowledge in ElementKG. Traditional graph augmentation techniques for creating positive pairs of contrastive learning often involve dropping nodes or perturbing edges, which can violate chemical semantics within molecules. To address this issue and establish more meaningful connections between atoms, we propose an element-guided graph augmentation approach for constructing positive pairs in contrastive learning.
As shown in Fig. 1b, we begin by identifying the element types present in a given molecule (for example, C, N and O) and retrieving their corresponding entities and relations from ElementKG (for example, (N, hasStateGas, O), (O, inPeriod2, C)). This forms an element relation subgraph that describes the relationships between elements using their associated entities and relations. We link the element entity nodes in this subgraph to their corresponding atom nodes in the original molecular graph to create an augmented molecular graph that integrates fundamental domain knowledge and captures the essential associations between atoms that share the same element type, even if they are not directly connected by chemical bonds. Our approach preserves the topology structure while incorporating important chemical semantics. Additional details about the input features and the triple definition can be found in Supplementary Information.
On top of this, we employ a contrastive learning framework to train the graph encoder by maximizing the consistency between the original molecular graph and the augmented molecular graph, without indiscriminately embedding element knowledge in the augmented graph.  We use an element-guided graph augmentation strategy based on element knowledge of ElementKG to convert the original molecular graph G into the augmented molecular graph G , establishing essential connections between atoms beyond the inherent structure. The graph encoders are then trained to maximize the agreement between these two graph views to avoid excessive knowledge injection in G . c, Prompt-enhanced fine-tuning. We leverage functional group knowledge of ElementKG to generate a corresponding functional prompt for each molecule, stimulating the pre-trained graph encoder to recall the learned molecular property-related knowledge and bridging the gap between the pre-training contrastive tasks and the downstream tasks. The resulting prompt-enhanced molecular graph is then fed into the pre-trained graph encoder for molecular property prediction.
Article https://doi.org/10.1038/s42256-023-00654-0 Given a minibatch of N randomly sampled molecules, we create a set of 2N graphs by transforming their molecular graphs using element-guided graph augmentation. Following refs. 12,22, we treat the 2(N − 1) graphs other than the positive pair within the same minibatch as negatives, where a positive pair consists of the original molecular graph G i and its augmented molecular graph G i . We apply a graph encoder f(⋅) to extract graph embeddings from the two graph views, and a non-linear projection network g(⋅) to map these embeddings into a space where the contrastive loss is applied, resulting in two new representations . Finally, a contrastive loss is used to maximize the consistency between positive pairs while minimizing the agreement between negative pairs. For further details, refer to Methods.
Prompt-enhanced fine-tuning. After pre-training, the molecular graph encoder needs to be fine-tuned for downstream property prediction. Specifically, the input molecular graph G is fed into the pre-trained graph encoder f(⋅) to extract the graph embedding h G , which is then fed into the predictor to output the property value. To bridge the gap between the pre-training contrastive tasks and downstream tasks, we propose to use functional group knowledge as prompts to stimulate the pre-trained graph encoder.
As shown in Fig. 1c, we generate the functional prompt from the functional group knowledge of ElementKG. First, we detect all functional groups in the input molecule, retrieve their corresponding entity embeddings in ElementKG and construct a mediator with a learnable embedding to capture the importance of each functional group. We then apply a self-attention mechanism to the embedding of the mediator (coloured in red) and the embeddings of the functional group entities to comprehensively aggregate their semantics and obtain the functional prompt. Finally, the functional prompt is added to the original representation of each atom node in the input molecular graph with a learnable scale parameter to produce the prompt-enhanced molecular graph, which is then fed into the pre-trained graph encoder and a predictor for molecular property prediction. The technical details of functional prompts are provided in Methods.

KANO boosts the performance of property prediction
Molecular properties of interest can vary widely in scale, ranging from macroscopic influences on the human body to microscopic electronic properties, such as drug side-effects 23 , the ability to inhibit human immunodeficiency virus (HIV) replication 24 and hydration free energy 3 . To assess the effectiveness of KANO, we evaluated its performance on datasets in four categories: physiology, biophysics, physical chemistry and quantum mechanics. For more information on the datasets and baselines, please refer to Supplementary Information. Tables 1 and 2 present the results of various supervised and SSL methods. #Molecules represents the number of molecules in each dataset, and #Tasks indicates the number of binary prediction tasks in each dataset.

Fig. 2 | Illustration of ElementKG and its embedding process. a,
A snapshot of ElementKG. ElementKG contains the class hierarchy, data properties, object properties and entities of both elements and functional groups. b, The process of ElementKG embedding. We derive a corpus of three documents (structure document, lexical document and combined document) from ElementKG, considering the structural topology, literal semantics and correspondence between entity IDs and literal words in ElementKG, respectively. We then train a language model to learn entity and relation embeddings from this corpus. This process enables the integration of element and functional group knowledge into a unified representation, which facilitates downstream molecular property prediction.
Article https://doi.org/10.1038/s42256-023-00654-0 Table 1 reports the test receiver operating characteristic-area under curve (ROC-AUC,%) on classification tasks in physiology and biophysics. Key observations include: (1) KANO consistently outperforms other methods on all eight datasets, with a significant improvement of 3.79%, showcasing its effectiveness. (2) KANO performs well on multiple-task learning datasets such as Tox21, ToxCast, SIDER and MUV. In particular, KANO achieves a 3.39% improvement on the ToxCast dataset with 617 binary classification tasks. The robust performance indicates that its representations cover diverse molecular semantics. Table 2 presents the test performance of regression tasks in physical chemistry and quantum mechanics. The key observations  In summary, KANO outperforms other models in all benchmarks, demonstrating the effectiveness of integrating ElementKG into the pre-training and fine-tuning stages. KANO not only outperforms other SSL methods but also demonstrates its superiority over supervised methods, providing a competitive advantage for generalization to a broader chemical space.

Richer knowledge in KG leads to more robust representations
ElementKG is essential in the KANO framework as it guides molecular augmentation and functional prompt generation. To determine the contributions of its various components, we evaluate KANO's performance using different KG components, such as class hierarchy, data property and functional group knowledge. We only prune ElementKG's components during pre-training and keep the experimental settings for fine-tuning consistent with the original KANO approach.
Extended Data Fig. 1a  To further investigate the impact of data properties, which each element contains more than 15 of, we mask a certain proportion of them and report the test performance on four categories of tasks. Extended Data Fig. 1b shows the test results for varying keeping rates of data properties. Notably, the model's performance consistently improves as the proportion of retained properties increases, verifying that richer data properties provide more comprehensive fundamental knowledge and consequently enable the learning of more robust molecular representations.

Contrastive learning produces a high-quality feature space
The quality of a representation space can be evaluated by two key properties: alignment and uniformity 25 . The former indicates that similar samples should be mapped to nearby embeddings, while the latter suggests that feature vectors should be uniformly distributed on the unit hypersphere, preserving as much data information as possible. In Fig. 3, we compare the molecular representations produced by our method with those obtained by other methods, including a supervised model (CMPNN 26 ), a representative predictive method (GROVER 8 ) and a contrastive method with universal augmentation strategy (MolCLR CMPNN 11 ).

Alignment analysis.
We visualize representations of the molecules with different scaffolds by t-distributed stochastic neighbour embedding (t-SNE) 27 to test whether molecules with the same scaffold would have similar representations. The scaffold, which represents the core structure of a molecule, is a fundamental concept in chemistry and provides a basis for systematic investigations of molecular cores and building blocks 28 . Molecules with different scaffolds typically have very different chemical properties. We choose the seven most common scaffolds from each dataset (Tox21, QM7 and BBBP) and distinguish the scaffolds with different colours. As shown in Fig. 3a, the model without pre-training cannot distinguish molecules with these scaffolds, and the predictive and contrastive methods show only slight improvement. In contrast, KANO produces more distinctive clusters with the lowest Davies-Bouldin (DB) index.  Uniformity analysis. To examine the uniformity of the learned molecular representations, we first map them onto the unit hypersphere 1 using t-SNE 27 , and then visualize the density distributions of the representations on 1 using non-parametric Gaussian kernel density estimation (KDE) 29 in ℝ 2 . We also show the density estimations of angles for each point on 1 to present the results more clearly. Figure 3b illustrates the feature and density distributions of the molecular representations learned by our model and the three baselines on the Tox21, ToxCast and ClinTox datasets. In the first three columns, the distributions of the representations are relatively highly clustered with sharp density distributions. In the last column, the distribution becomes more uniform, and the density estimation curves are markedly less sharp. From Fig. 3, we observe that our model can map molecules with the same scaffold to similar representations, and the pre-trained representations have a more uniform distribution than the baselines. Our ElementKG and KG-guided contrastive learning framework enable KANO to capture globally intrinsic molecular characteristics by normalizing the filtering of knowledge and perceiving global structural insights. Supplementary Information provides additional visualizations of KANO pre-trained representations.

Functional prompts enable explainable predictions
In Extended Data Fig. 2, we compared KANO's performance with functional prompts with that without prompts and evaluated two alternative architectures that integrated functional group knowledge through adding and concatenating to each atom. Results show that the model with functional prompts performs better than the one without, with an 8.41% relative improvement. Furthermore, adding and concatenating functional group features were proven to be suboptimal choices, emphasizing the effectiveness of functional prompts.
Since functional prompts act as a bridge between pre-training contrastive tasks and downstream molecular property prediction tasks, we are interested in their potential to provide domain-specific interpretability. We visualize the attention weights of functional groups in molecular graphs from four property categories in Fig. 4. (1) The first example is from the Tox21 (ref. 30) public database, which measures the toxicity of compounds. We observe higher attention weights for pyridyl and azo functional groups, followed closely by primary amine. Interestingly, pyridyl and primary amine groups can combine to form 2,6-diaminopyridine, a major component of secondary hepatotoxins and skin sensitizers 31 . Azo-containing compounds, such as azo dyes, exhibit carcinogenic and mutagenic properties, making them highly significant 32 . (2) The second example is a human β-secretase 1 (BACE-1) inhibitor from the BACE dataset 33 . The molecule assigns more attention to amidine, carboxamide and secondary ketimine, which form the imidazole component. In addition, pyridyl and phenyl also receive more attention. These findings align with previous research 34,35 , suggesting that the aromatic heterocycle family inhibits BACE-1. (3) The third sample is from FreeSolv 36 , which focuses on the hydration free energy of small molecules in water. Fluoro and hydroxyl groups receive higher attention due to fluoro's strong electron-acquiring ability and hydroxyl's hydrophilicity, affecting the molecule's interaction force with water. Additionally, carboxyl groups with strong polarity receive more attention weights. (4) The final molecule is from QM7 (ref. 37), recording the atomization energies of molecules. Alkenyl and carboxamide groups receive more attention due to the higher bond energy of the carbon-carbon double bond and the stability of the amide bond, requiring more energy to break them apart into separate atoms. The interpretability exploration illustrates how functional prompts bridge the gap between pre-training tasks and downstream tasks by invoking

Conclusion
In this study, we presented KANO, a novel approach that enhances molecular property prediction tasks by incorporating chemical domain knowledge. KANO achieved superior performance on 14 molecular benchmarks by leveraging ElementKG, a KG that organizes the knowledge of elements and functional groups. KG-guided pre-training allowed KANO to obtain a high-quality molecular representation space, while functional prompts captured meaningful chemical substructures relevant to downstream tasks. While KANO has shown promising performance, it may still have some limitations. For instance, ElementKG may not fully capture molecular system complexity, and the current functional prompts may not be able to capture long-range interactions between substructures. To address these limitations, we suggest several interesting future directions. Firstly, extending ElementKG to cover other areas of chemistry and integrating it with other existing KGs could provide a more comprehensive understanding of molecular systems. Secondly, studying the interpretability of KANO's learned representations and the chemical knowledge captured by the functional prompts could provide insights for molecular design and optimization. Finally, exploring the possibility of combining KANO with other techniques to improve its performance on small datasets and accelerate drug discovery could be a promising direction to pursue.

ElementKG construction and representation
We constructed ElementKG by integrating knowledge from the Periodic Table and Wikipedia pages, providing a holistic view of the element class hierarchy, the chemical attributes of elements and functional groups, and the relations between them. The detailed construction process is shown in Fig. 2 and described below.
First, we extracted the class hierarchy from the collected knowledge of elements and functional groups, which serves as the backbone of ElementKG. As shown in the upper part of Fig. 2, blue blocks represent different classes and blue arrows reflect the containment or disjoint relations between them. For example, the rdfs:subClassOf construct between the class ReactiveNonmetals and the class Nonmetals means that the set of entities in ReactiveNonmetals is a subset of entities in Nonmetals. Also, every entity in the Ester class is a member of its parent class, GroupContainingOxygen. It is important to note that the subclass relations are transitive, implying that the ReactiveNonmetals class is also a subclass of the Element class. However, since literal names can be insufficient to differentiate between different classes, we defined disjointness for the classes and added disjointness axioms using owl:disjointWith. For example, the disjointness between the Metals and Nonmetals classes indicates that an element entity in the Metals class cannot be a member of the Nonmetals class at the same time. Using the class hierarchy, we assigned corresponding entities to each class via rdf:type, with both C and O elements in red blocks being members of the ReactiveNonmetals class.
Second, we compile a list of chemical attributes sourced from the Periodic Table and assign them as data properties to each entity in ElementKG (the dotted block). Over 15 data properties, including hasName, hasAtomic, hasDensity and hasIonization, are associated with each element. On the other hand, for functional groups, we record the type of bonds they contain. For instance, CarboxylhasBondType contains single and double bonds, while Phenyl contains both single and aromatic bonds.
Third, we use object properties (red directional arrows) to model the relationships between entities in ElementKG. To achieve this, we discretize the continuous chemical attribute values of elements and use them as object properties (for example, inRadiusGroup1 and inWeightGroup2) to connect element entities to each other. For instance, the triple (C, inRadiusGroup1, O) indicates that the entities C and O are both in Radius Group 1, while (C, hasStateGas, O) means that they are both in the gaseous state. We add symmetric characteristics to these object properties, which means that (O, hasStateGas, C) also holds when given (C, hasStateGas, O). Since ElementKG is primarily element oriented, we do not directly add object properties to functional groups. Instead, we establish the connection between element and functional group entities through the isPartOf object property, which indicates that the element is involved in the formation of the functional group.
To fully explore the structural and semantic information and obtain meaningful representations of all entities, relations and other components in ElementKG, we employ a KG embedding approach based on OWL2Vec* (ref. 21). As illustrated in Fig. 2b, this approach involves two steps: (1) extracting a corpus from ElementKG, including a structure document, a lexical document and a combined document, and (2) training a language model on the corpus to obtain high-quality KG embeddings 38 . The structure document captures the graph structure and the logical constructors by computing random walks for each target entity and combining the traversed relations and entities into sentences. For example, a random walk of depth 3 starting from the element C would result in the sentence (C, inRadiusGroup1, O, rdf:type, ReactiveNonmetals). The lexical document includes sentences parsed from the structure document. For example, the sentence above can be parsed as ('C', 'in', 'radius', 'group1', 'O', 'type', 'reactive', 'nonmetals'). To establish the correspondence between entities and their literal names, we replace each word in the lexical document with the corresponding entity in the structure document, resulting in a combined document. That is, the example above can be converted to a set of sentences: (C, 'in', 'radius', 'group1', 'O', 'type', 'reactive', 'nonmetals'), ('C', inRa-diusGroup1, 'O', 'type', 'reactive', 'nonmetals') and so on. These three documents are merged into a single document, which is then used to train a word2vec 39 model with the skip-gram architecture. Finally, we obtain embeddings for each entity and relation in ElementKG, which we use for input feature initialization of the augmented molecular graph and functional prompt generation.

Contrastive learning framework
We employ a contrastive learning framework to learn the representations of molecular graphs. Given a minibatch of size N, we generate 2N graphs by transforming the N original molecular graphs into N augmented molecular graphs. The original molecular graph G i and its augmented version G i constitute a positive pair (G i , G i ), while (G i , G j ) j≠1 and (G i ,G j ) j≠1 form negative pairs.
After capturing the graph representations using the graph encoders f(⋅), a non-linear transformation g(⋅) called the projection network maps both the original and augmented graph representations to a latent space where the contrastive loss is calculated, as proposed in simCLR 40 . We adopt a two-layer perceptron (MLP) to perform the projection. Then, we use the normalized temperature-scaled cross-entropy (NT-Xent) loss function 40 to train the graph encoders to maximize the agreement between positive pairs and the discrepancy between negative pairs. Let sim(z 1 , z 2 ) = z ⊤ 1 z 2 ‖z 1 ‖⋅‖z 2 ‖ denote the cosine similarity between ℓ 2 normalized z 1 and z 2 . The loss function for a positive pair (G i ,G i ) is defined as where [k≠i] is an indicator function that evaluates to 1 if k ≠ i, τ is a temperature parameter and z represents the latent representation.
Article https://doi.org/10.1038/s42256-023-00654-0 The numerator of the contrastive loss measures the agreement between the positive pair, while the denominator calculates the sum of the agreement between each graph and the other 2N − 1 graphs. This means that the latent representation z z z G i of the original graph should consider the similarity with not only other original graph latent vectors {z z z G k } k≠i but also all augmented graphs {z z zG . The latent representation of the augmented graph z z zG i also follows the same calculation process. Finally, the loss is computed across all positive pairs in the minibatch.

Prompt generator
To stimulate the pre-trained model to recall the relevant knowledge learned before, we design a prompt generator f prompt to produce a prompt x prompt based on ElementKG and the input molecular graph G, that is, x prompt = f prompt (G, ElementKG). We detect all functional groups contained in G using the open-source package RDKit 41 and retrieve the corresponding functional group entities in ElementKG on the basis of their names. Then we obtain the embeddings of functional group entities {x 1 , …, x m } using the KG embedding method, where m is the number of detected functional groups. To capture the importance of functional groups, we construct a learnable vector as the mediator (denoted as x 0 ) and then apply the self-attention mechanism 42 on both the embeddings of the mediator and functional groups. Specifically, the input X = {x 0 , x 1 , …, x m } is first projected into the query/key/value vector: where W Q , W K , W V ∈ ℝ d×d and d is the hidden dimension. The self-attention mechanism calculates the attention weight between queries and keys, and then multiplies by the value. The output embedding is formulated as We implement two self-attention layers and obtain the embedding of the mediator x ′ 0 = X ′ [∶, 0], which reflects the combined contributions of functional groups with varying importance. We then feed it into a fully connected layer followed by layer normalization 43 to obtain the functional prompt Finally, we add the prompt x prompt to the original representation of each atom node in G with a learnable scale parameter α, resulting in the new input feature of a node v in G expressed as We then feed this prompt-enhanced molecular graph into the pre-trained graph encoder, followed by a prediction network for downstream molecular properties.

Graph encoder architecture
A molecular graph can be represented as G = ( , ), where denotes a set of nodes and denotes a set of edges. Each edge is bidirectional. Let x v denote the initial features of node v, and x e (u,v) as the initial features of edge e (u, v) . In particular, for atoms and bonds in the original molecular graph, we extract different initial features for them following specific chemical rules, as detailed in Supplementary Information.
Taking Fig. 1b as an example, for the augmented graph, we take the element entity embeddings obtained above as the initial features of element nodes. The initial feature of an edge between every two element nodes is obtained by mean pooling of the embeddings of multiple relations between the corresponding element entities in ElementKG. Following the same feature extraction method in the original molecular graph, we obtain the initial features of atoms and bonds. The edges between elements and their corresponding atoms are distinguished by different random initialization features, that is, the dashed edges with the same colour represent the same initial features while different colours indicate different representations.
Given the graph structure, node features and edge features, our goal is to learn a graph encoder f(⋅) that maps the input graph to a vector representation. In our case, we implement CMPNN 26 as the graph encoder, which improves graph embeddings by strengthening the message interactions between edges and nodes.
Firstly, to update the node hidden states, each node v ∈ aggregates representations of their incoming edges instead of its neighbouring nodes in G. The intermediate message vector is obtained as where k denotes the current depth of the message passing, the pooling operator is a max pooling function and ⊙ is an element-wise multiplication operator. Here we apply max pooling to highlight the edges with the highest information intensity, as the hidden state of a node is mainly based on the strongest message from incoming edges. Then, the node's current hidden state h k−1 (v) is concatenated with the message vector m k (v) and fed through a communicative function to update the node's hidden state h k (v): where the hidden state h k (v) acts as a message transfer station that receives incoming messages, integrates them and sends them to the next station. The specific communication function is implemented by feeding both the node and edge features into an MLP followed by a rectified linear unit (ReLU) activation. Secondly, we extract message of the edge e (v, w) by subtracting its inverse edge information from the h k (v): where e (w, v) is the inverse edge of e (v, w) . To update the edge hidden states, we first feed the edge intermediate message m k (e (v,w) ) into a fully connected layer and add it with the initial edge feature x e (u,v) . We apply a ReLU activation function to the output and use it as the intermediate message vector for the next iteration. This procedure can be mathematically expressed as Thirdly, after K iterations, one more round of interaction is applied: then the final node representation h(v) of the graph is obtained by gathering the message from incoming edges, the current node representation and the initial node feature: Finally, a readout operator is applied to get the whole graph representation: where GRU is the gated recurrent unit introduced in ref. 44. Implementation details. Since the raw data are in the form of molecular SMILES, which is a line notation for describing the structure of chemical species using short ASCII strings, we utilize the open-source chemical analysis tool RDKit to convert them into 2D molecular graphs and extract the atom and bond features. The initial features of atoms are determined by their associated eight attributes (for example, chirality, hybridization and atomic mass), and the bonds are embedded by their four related attributes (for example, bond type and conjugated), as detailed in Supplementary Information. In contrastive pre-training, we utilize the Adam optimizer with a learning rate of 3 × 10 −5 to optimize the NT-Xent loss and set the temperature parameter τ to 0.1. We apply an MLP with a ReLU activation function as the projection network. The model is trained with a batch size of 1,024 and 50 epochs.
In prompt-enhanced fine-tuning, we use RDKit to detect the functional groups in each molecule. We apply two self-attention layers on all functional groups and the mediator. The output is fed into a fully connected layer, which is then layer normalized. We adopt a two-layer MLP as the property prediction network. For classification tasks, we utilize the binary cross-entropy (BCE) loss combined with the sigmoid layer (BCEWithLogits loss) when training the graph encoder and the property prediction network, while for regression tasks, we apply the mean squared error loss. The Adam optimizer is applied to the graph encoder with a learning rate ranging from 1 × 10 −4 to 1 × 10 −3 for all datasets, and the learning rate of the prompt generator is five times that of the graph encoder. We train the model on the training set and search hyper-parameters on the validation set for the best results. The training is set to 100 epochs. We implement fine-tuning of the pre-trained model three times with a batch size of 256 to report the average and standard deviation of performance on the testing set, using ROC-AUC for classification tasks and mean absolute error/root mean square error for regression tasks. KANO is implemented using Pytorch and runs on a Ubuntu Server with NVIDIA GeForce RTX 3090Ti graphics processing units.

Data availability
The ElementKG, pre-training data and molecular property prediction benchmarks used in this work are available in the Code Ocean capsule at https://doi.org/10.24433/CO.5629517.v1 and the GitHub repository at https://github.com/HICAI-ZJU/KANO. Source data are provided with this paper. Article https://doi.org/10.1038/s42256-023-00654-0 Reprints and permissions information is available at www.nature.com/reprints.
Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons. org/licenses/by/4.0/.

© The Author(s) 2023
Nature Machine Intelligence Article https://doi.org/10.1038/s42256-023-00654-0 Extended Data Fig. 1 | Exploration of knowledge abundance in ElementKG. a, Performance of KANO with different ElementKG components. Green denotes the removal of class hierarchy from ElementKG, which removes various classes (except for the lowest-level classes directly connected with entities), as well as axioms rdfs:subClassOf and owl:disjointWith. It consists only of entities, lowest-level classes, data properties, and object properties. Purple denotes the deletion of data properties of each entity. Yellow represents the removal of the entire functional group component, including class hierarchy and entities of functional groups, and their relations with element entities. Red indicates the complete ElementKG with all components. The results are reported as mean values +/-SD on three independent runs. The error bars represent the SD, while the dots represent three individual data points. b, Performance of KANO with different keeping rates of data properties in ElementKG. We vary the proportion of data properties of element entities retained in ElementKG and report the corresponding performance trends across datasets in various domains, represented by different colors. The horizontal axis represents the keeping rate, which refers to the proportion of knowledge introduced. The vertical axis represents the performance measured by ROC-AUC on classification tasks (higher is better) and RMSE and MAE on regression tasks (lower is better). The results are reported as mean values +/-SD on three independent runs. The mean is represented by the lines, the SD is depicted by the error bars, and individual data points are marked with dots.