Hyper-Mol: Molecular Representation Learning via Fingerprint-Based Hypergraph

With the development of artificial intelligence (AI) in the field of drug design and discovery, learning informative representations of molecules is becoming crucial for those AI-driven tasks. In recent years, the graph neural networks (GNNs) have emerged as a preferred choice of deep learning architecture and have been successfully applied to molecular representation learning (MRL). Up-to-date MRL methods directly apply the message passing mechanism on the atom-level attributes (i.e., atoms and bonds) of molecules. However, they neglect latent yet significant hyperstructured knowledge, such as the information of pharmacophore or functional class. Hence, in this paper, we propose Hyper-Mol, a new MRL framework that applies GNNs to encode hypergraph structures of molecules via fingerprint-based features. Hyper-Mol explores the hyperstructured knowledge and the latent relationships of the fingerprint substructures from a hypergraph perspective. The molecular hypergraph generation algorithm is designed to depict the hyperstructured information with the physical and chemical characteristics of molecules. Thus, the fingerprint-level message passing process can encode both the intra-structured and inter-structured information of fingerprint substructures according to the molecular hypergraphs. We evaluate Hyper-Mol on molecular property prediction tasks, and the experimental results on real-world benchmarks show that Hyper-Mol can learn comprehensive hyperstructured knowledge of molecules and is superior to the state-of-the-art baselines.


Introduction
Machine learning has achieved great success in the feld of artifcial intelligence (AI), which has been pervasively adopted in many human-centered applications [1][2][3][4][5]. Following the machine learning rules, large amounts of research efort have been dedicated to develop new paradigms for drug design and discovery in recent years. How to learn informative representations of molecules is critically important for AI-driven tasks [6][7][8]. For example, the welllearned molecular representations can be benefcial for molecular property prediction, which advances drug candidate selection for further validation and virtual screening on large datasets. Te chemical fngerprints [9] are widely used for representing molecules, the algorithms of which normally encode the physical or chemical characteristics of molecules into bit vectors. Another pipeline of research [10][11][12] introduces deep learning [13] to generate structure-aware or context-aware neural fngerprints for molecules. Since molecules can be naturally converted to graphs, where atoms and bonds are represented as nodes and edges, respectively [14], graph neural networks (GNNs) are commonly applied for molecular representation learning (MRL). Most related approaches [15][16][17][18][19] have dedicated tremendous efort on modeling atom-level relationships. Some [6-8, 14, 20] utilize the molecular geometry and structural information to develop a self-supervised learning paradigm for pretraining the GNN models. Following the message passing rules of GNNs, they carefully design the learning procedures to encode structural information on atom and bond attributes. Despite the promising results achieved by recent MRL methods in many drug design and discovery tasks, we argue that the following issues have not been solved.
(i) Te chemical fngerprints use bits to preserve the existence of some physical or chemical characteristics of molecules. However, the topological information and the latent relationships among the extracted fngerprint substructures cannot be leveraged in such bit-style forms. (ii) Although some structure-aware or context-aware information about atom and bond interactions can be encoded to generate molecular representations, the hyperstructured knowledge, such as the information of a pharmacophore or functional class, has not been exploited.
Hence, to deal with the aforementioned problems, we introduce the concept of hypergraph and propose a novel MRL framework, dubbed Hyper-Mol, which encodes fngerprintbased Hypergraph structures of Molecules via GNNs. Hyper-Mol further exploits the information underneath the bit-style molecular fngerprints, learning molecular representations by exploring the hyperstructured knowledge and the latent relationships of the fngerprint substructures. Specifcally, in Hyper-Mol, we utilize molecular fngerprint algorithms to produce topological fngerprints with physical and chemical characteristics of molecules, in which the pharmacophore-aware or functional class-aware components can be embedded in the generated clusters (i.e., the substructures of fngerprints) according to the algorithms. Te basic idea of molecular hypergraph generation is that two objects are close to each other if they are referenced by similar or shared objects [2,21,22]. Tus, the hypergraph of each molecule is then constructed based on the topological relationships among the fngerprint substructures. To be precise, any two fngerprint substructures of a molecule that have overlapped subregions (i.e., shared atoms or bonds) should be close to each other in the hypergraph, which means that they will have a positive hyperlink in the hypergraph. Te intra-structured information in fngerprint subgraphs and the inter-structured information in fngerprint hypergraphs are encoded via the message passing mechanism to learn comprehensive hypergraph representations for molecules.
Hence, we conclude our contributions as follows: (1) We propose Hyper-Mol, which learns molecular representations by utilizing molecular fngerprints from a hypergraph perspective. (2) Te algorithm of molecular hypergraph generation is designed for preserving the hyperstructured information with physical and chemical characteristics of molecules. (3) Te hyperstructured knowledge of molecular fngerprints can be exploited by the fngerprint-level message passing process from both intra-structured and inter-structured information according to the molecular hypergraphs. (4) Te experimental results show that Hyper-Mol can learn comprehensive molecular representations for molecular property prediction tasks compared with the state-of-the-art methods.
Te rest of the paper is organized as follows: In Section 2, related work is briefy introduced. In Section 3, we present Hyper-Mol. After that, the proposed method is evaluated over several state-of-the-art baselines and the detailed experiments are given in Section 4. Finally, we conclude our work and point out the future work in Section 5.

Fingerprint Generation on Molecules.
Traditional ways of representing molecules are the chemical fngerprints, such as pharmacophore fngerprints [23,24], functional-class fngerprints, and extended-connectivity fngerprints [9]. Tese algorithms mostly utilize bit vectors to represent the existence of pharmacophore, functional classes, or geometric characteristics in molecules. Inspired by the success of deep learning in computer vision and natural language processing, some deep neural architectures are introduced to generate low-dimensional vector representations for molecules. For example, prior studies [10,11] make use of the convolutional neural networks [25] to learn molecular neural fngerprints. Xu et al. [12] propose Seq2Seq fngerprints by exploiting the SMILES [26,27] strings based on the sequence-to-sequence neural framework [28,29].

Molecular Representation Learning on Graphs.
Due to the fact that molecules can be easily converted to graph data, graph neural networks (GNNs) have been widely adopted to learn molecular representations in recent years. Some approaches [15][16][17] apply graph convolutional networks [30] to encode atom relationships in molecules. To capture bond features, [18,19] further develop the message passing process that also models bond interactions. MGCN [31] is proposed to model the multilevel quantum interactions of molecules from hierarchical perspectives (i.e., atom-wise, pair-wise, triple-wise, and so on). With the development of self-supervised learning, Hu et al. [20] propose pretraining strategies to learn molecular representations with self-supervised pretext tasks in atom level. Tey defne several types of graph proximity as the self-supervised learning objectives, which push GNNs to generate meaningful atom representations. Other up-to-date techniques [6][7][8]14] follow the same idea and develop more molecular information-related pretext strategies. N-Gram [32] conducts node (atom) representations by predicting the node (atom) attributes, which utilizes SMILES strings.
Diferent from the previous work, our proposed Hyper-Mol not only enhances the expressive power of chemical fngerprints but also models the topological information and the relationships of the fngerprint substructures (with physical and chemical characteristics) from a hypergraph perspective.

Preliminaries. Let G � (V, E) be a molecular graph,
where V denotes the atom set and E denotes the bond set of the molecule. Suppose a molecule has n fngerprint substructures and the structural set is 3.1.1. Molecular Fingerprint. Molecular fngerprints are a way of encoding the structure of a molecule [33]. Te most common type of fngerprint is a series of binary digits (bits) [34,35] that represent the presence or absence of particular substructures in the molecule. Terefore, the similarity between two molecules can be calculated by comparing their fngerprints.

Graph Neural Networks.
Te architecture of graph neural networks (GNNs) has recently been developed as one of the crucial deep learning techniques. Te core idea behind GNNs is message passing through network topology in graphs. Node representations are updated by propagating and aggregating structural information from the neighborhood to the target node.
where the AGGREGATE function in the k th layer aggregates neighborhood information of the target node v, and the COMBINE function combines the information of the target node v and its neighborhood N(v). Te READOUT function normally applies sum/mean/max pooling methods to generate the graph representation h G .

Overall Framework.
Hyper-Mol encodes graph structures of molecules via the fngerprint-based features. As illustrated in Figure 1, the overall framework of Hyper-Mol consists of three main components: fngerprint extraction, hypergraph generation, and hypergraph feature encoding.

Fingerprint Extraction.
Te extended-connectivity fngerprints (ECFPs) are a class of topological fngerprints for molecular characterization [9]. Physical and chemical characteristics of molecules can be encoded by ECFPs. For example, the functional-class fngerprints are a variant of the ECFPs that describe substructures according to their roles in pharmacophores. Tus, in Hyper-Mol, we employ the ECFPs algorithm to extract molecular fngerprints (note that any fngerprint extraction algorithms that satisfy the rules of Hyper-Mol can be employed without restriction) due to its interpretability and efectiveness in modeling [36].

Hypergraph Generation.
Te hypergraph of each molecule is then generated based on the topological relationships among the extracted fngerprint substructures and the molecular graph, where nodes are the fngerprint substructures and edges are the connections between substructures in the molecular graph. To be precise, the intrastructured information of a fngerprint substructure is composed of atoms and bonds. Any two substructures that have overlapped intra-structured regions (i.e., shared atomlevel structures) in the molecular graph will have a hyperlink between each other.

Hypergraph Feature Encoding.
In Hyper-Mol, the Intra-Encoder encodes the intra-structured information for each fngerprint substructure, the output of which is used as the initial fngerprint substructure representations. Te Inter-Encoder takes in the hypergraphs and the fngerprint substructure representations of molecules afterwards, propagating and aggregating the interstructured information among fngerprint substructures following the message passing mechanism of GNNs. Based on equation (1), the hypergraph-level representations of molecules are obtained after training the neural models.

Fingerprint-Based Hypergraph Generation.
Te extended-connectivity fngerprints are circular topological fngerprints that are designed for molecular characterization and structure-activity modeling. In the hypergraph generation process, we frst apply the ECFPs algorithm [9] to generate fngerprints and the substructures.
Suppose there are M fngerprints generated according to M molecules. S � S 1 , S 2 , . . . , S n denotes the substructure set of a fngerprint from molecular graph G (without loss of generality, we omit the subscripts of S, G for simplicity). Algorithm 1 illustrates the process that generates the hypergraph of a molecule based on its fngerprint substructures. We frst obtain all the relative positions among the fngerprint substructures by the Cartesian product (Line 2). And then, we set a positive hyperlink between the two substructures if they share at least one common subregion from G (Line 5 to 6). Otherwise, a negative hyperlink will be set between the two (Line 8). E collects all the hyperlinking information among the fngerprint substructures (Line 10). Finally, a new hypergraph of the molecule is generated.

Hypergraph Feature Encoding.
Hyper-Mol encodes hypergraph features by the two kinds of encoders: the Intra-Encoder and the Inter-Encoder.

Intra-Encoder.
According to the ECFPs algorithm, the number of the generated fngerprint substructures is fxed. Tus, the Intra-Encoder simply adopts the one-hot encoding to Computational Intelligence and Neuroscience distinguish each fngerprint substructure in a "fngerprint substructure vocabulary" from every other fngerprint substructure in the "vocabulary." Te output representations X of fngerprint substructures are a N × N matrix, where N represents the number of fngerprint substructures and also the fxed length of the one-hot vector. Each vector in the matrix consists of 0 s in all cells with the exception of a single 1 in a cell used uniquely to identify the fngerprint substructure.

Inter-Encoder.
Te molecular hypergraphs and the one-hot fngerprint substructure representations are fed to the Inter-Encoder, in which we apply two widely-adopted GNN backbones, i.e., the graph convolutional networks (GCNs) and graph isomorphism networks (GINs), to respectively encode the hyperstructured features for each molecule.
Te layer-wise propagation rule of GCNs in the Inter-Encoder is as follows: where A � A + I n is the adjacency matrix of the undirected hypergraph G with added self-loops. I n is the identity matrix. D ii � j A ij . W (k) is the k th layer trainable weight matrix and σ(·) is an activation function. H (k) represents the hidden representations of the fngerprint substructures in the k th layer.

Hypergraph Representation.
To obtain the hypergraph representation of G, we apply a sum-pooling layer after the graph convolution layers of Inter-Encoder.

Time Complexity. Given a molecular graph G � (V, E)
and its generated hypergraph G � (S, E), the time complexity of extracting fngerprint substructures is O(|V| 2 ) following the ECFPs algorithm that two iterations are enough for fngerprints to be functional in similarity search and clustering [9]. With the complexity of O(|S|), we can obtain the nodes (i.e., the fngerprint substructures) of the molecular hypergraph. After that, the edge (i.e., the hyperlink) generation in the hypergraph can be operated in O((1/2)|S| 2 ). Due to the GNN architecture, the time complexity of graph convolution operation is O(|E|) per neural layer.

Datasets.
We conduct the experiments on the HIV, BBBP, BACE, Tox21, SIDER, and ClinTox molecular property prediction benchmark datasets (https:// moleculenet.org/datasets-1), all of which are from Mole-culeNet [37]. Te prediction tasks can be formulated as a set of binary and multilabel graph-level classifcation problems. To be precise, the HIV, BBBP, and BACE datasets are used for the binary classifcation tasks and the Tox21, SIDER, and ClinTox datasets are for the multilabel classifcation tasks. Te detailed descriptions of all datasets are shown in Table 1.

Baselines.
We thoroughly evaluate Hyper-Mol against 6 state-of-the-art approaches. Among them, graph convolutional networks (GCN) [30] and graph isomorphism networks (GIN) [38] are the two popular GNN-based frameworks that can learn the structural information of network-based data in a supervised manner. N-Gram [32] extracts the context of vertices and assembles the representations in short walks directly through the molecule graph. Hu et al. [20] design self-supervised strategies for learning molecular representations. SchNet [16] is a continuous-flter convolutional neural network for modeling quantum interactions and MGCN [31] considers modeling bond features in message passing processes.

Experimental Settings.
As suggested in the previous work [20], we adopt the scafold split to create the train/ validation/test with the ratio of 8 : 1 : 1. Te scafold splitting method splits molecules according to molecular substructures, which is more challenging yet realistic. Compared with the random split, it can better evaluate the generalization ability of the models on out-of-distribution data samples.
We apply the GCN and GIN architectures (i.e., the AGGREGATE and COMBINE functions) in Hyper-Mol, respectively. Te sum pooling is used as the READOUT function to obtain the molecular graph representations. We train the neural networks with 100 epochs and the batch size is 32 in each epoch. ReLU [39] is adopted as the activation function, and Adam [40] is employed for optimization. To ft the supervised molecular property prediction tasks, we use the sigmoid function and the binary cross entropy to measure the loss between the target and the predicted probabilities. Since the input vectors of the fngerprint representations are generated by the ECFPs algorithm, we set the two hyperparameters (i.e., the length and the radius) of ECFPs with commonly-adopted default values 2048 and 2, respectively.
We use the ROC-AUC (area under the receiver operating characteristic curve) [41] as the evaluation metric for both the binary and multilabel classifcation tasks. We execute three independent runs and the mean and the standard deviation of test ROC-AUC on each benchmark are reported. Tables 2 and 3 summarize the overall performances of Hyper-Mol along with other baseline methods, where the best results (i.e., higher is better) are shown in bold. We have the following observations: (1) Hyper-Mol achieves the best average ROC-AUC scores in both binary and multilabel tasks over the experimented datasets. Besides, Hyper-Mol outperforms all the state-of-the-art baselines on 4/6 datasets; (2) the GCN backbone in the Hyper-Mol framework is more efective than the GIN, which achieves an overall relative improvement of 1% on the average ROC-AUC scores.

Contribution of Hyper-Mol in Binary Classifcation.
As present in Tables 2 and 4, Hyper-Mol surpasses all the methods on the BBBP and BACE datasets, and also shows rival performance compared with the best-performed N-Gram on the HIV dataset. Moreover, both the GCN and GIN backbones in Hyper-Mol with fngerprint-level Computational Intelligence and Neuroscience message passing mechanism achieve 21.1% and 23.3% improvement, respectively, in comparison with those in the atom level.

Contribution of Hyper-Mol in Multilabel Classifcation.
Tables 3 and 4 demonstrate that the multilabel classifcation tasks are more challenging than the binary ones. Te models proposed by Hu et al. and N-Gram perform competitive in the multilabel classifcation tasks. Hyper-Mol still achieves the highest results on the SIDER (with 27 tasks) and ClinTox (with 2 tasks) datasets, respectively. As the similar phenomenon observed in binary classifcation tasks, the fngerprint-level message passing processes in Hyper-Mol applying the GCN and GIN backbone neural architectures also achieve 21.8% and 19.3% improvement, respectively, compared with the atom-level message passing.

Impact of ECFPs Hyperparameters.
Hyper-Mol applies the ECFPs algorithm to generate fngerprints for molecules. To show the impact of the hyperparameters (i.e., the length and the radius) on Hyper-Mol, we conduct two types of model sensitivity experiments: (1) we fx the radius with 2, and vary the length in the set 1024, 2048, 4096 { }; (2) we vary the radius from 2 to 4, with the length = 2048 fxed. Figure 2 presents how the fngerprint length afects the performance of Hyper-Mol on the SIDER (multilabel task) and BACE (binary task) datasets, respectively, under the circumstance that the radius is set to 2. We can observe that with a larger fngerprint length, Hyper-Mol with both GCN and GIN backbones achieves better performance on the SIDER dataset. Te best ROC-AUC score achieved by Hyper-Mol (GIN) with length = 4096 reaches to 0.664 ± 0.021. On the BACE dataset, there is also an improvement achieved by the larger length (2048 and 4096) compared with the relative small length (1024). Figure 3    Results on the experimented datasets show that the proposed Hyper-Mol is superior to the state-of-the-art baseline methods on the molecular property prediction tasks. Te message passing processes in the baselines aggregate and propagate structural information in the atom level, which force their neural networks to learn relatively "microscopic" graph-structured knowledge of molecules, i.e., the relationships of atoms and bonds. However, the more sophisticated information of molecules, such as the pharmacophore-aware or functional class-aware characteristics, is normally embedded in some meaningful clusters of atoms and bonds, for example, the components of molecular fngerprint substructures. Diferent from the ways of atom-level message passing that lack meaningful "interactions" between clusters, Hyper-Mol perceives hyperstructured information through the fngerprint-level message passing mechanism. Instead of absorbing atom-attributed or bond-attributed features only, Hyper-Mol utilizes fngerprint-attributed features to depict informative context  Computational Intelligence and Neuroscience relationships of the molecular fngerprint substructures. Physical and chemical characteristics of fngerprintspecifc knowledge can be encoded into the fnal molecular graph representation from a hypergraph perspective. Terefore, the overall performance of Hyper-Mol is superior to the baselines.

Conclusions and Future Work
In order to learn molecular representations with more sophisticated knowledge of physical and chemical characteristics, we propose Hyper-Mol, a novel MRL framework, which encodes Hypergraph structures of Molecules via fngerprintlevel message passing mechanism. Hyper-Mol constructs hypergraphs of molecules by utilizing both intra-structured and inter-structured topological information of chemical fngerprint substructures, and applies GNNs to learn meaningful molecular representations based on the extracted hyperstructured features. Experimental results present that Hyper-Mol can depict informative context relationships of the fngerprint substructures and is superior to the state-ofthe-art approaches on various molecular property prediction tasks, such as bioactivity, pharmacokinetics and toxicity. Future work would focus on exploring self-supervised or unsupervised learning framework for encoding hypergraph knowledge of molecules. Meanwhile, we also consider to incorporate both atom-level and fngerprint-level information to learn more comprehensive representations for molecules.

Data Availability
Te data used to support the fndings of this study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that they have no conficts of interest.