Hierarchical multimodal self-attention-based graph neural network for DTI prediction

Abstract Drug–target interactions (DTIs) are a key part of drug development process and their accurate and efficient prediction can significantly boost development efficiency and reduce development time. Recent years have witnessed the rapid advancement of deep learning, resulting in an abundance of deep learning-based models for DTI prediction. However, most of these models used a single representation of drugs and proteins, making it difficult to comprehensively represent their characteristics. Multimodal data fusion can effectively compensate for the limitations of single-modal data. However, existing multimodal models for DTI prediction do not take into account both intra- and inter-modal interactions simultaneously, resulting in limited presentation capabilities of fused features and a reduction in DTI prediction accuracy. A hierarchical multimodal self-attention-based graph neural network for DTI prediction, called HMSA-DTI, is proposed to address multimodal feature fusion. Our proposed HMSA-DTI takes drug SMILES, drug molecular graphs, protein sequences and protein 2-mer sequences as inputs, and utilizes a hierarchical multimodal self-attention mechanism to achieve deep fusion of multimodal features of drugs and proteins, enabling the capture of intra- and inter-modal interactions between drugs and proteins. It is demonstrated that our proposed HMSA-DTI has significant advantages over other baseline methods on multiple evaluation metrics across five benchmark datasets.


Introduction
Drug-target interaction (DTI) prediction is a critical part of the drug development process, but traditional screening experiments require a substantial investment of human and material resources, resulting in high drug development costs [1].With the increasing number of compound libraries and target libraries, computational methods have become an effective tool for improving drug development efficiency and reducing research costs [2].In general, there are two classes of methods for predicting DTIs: traditional machine learning methods and deep learning methods.Traditional machine learning methods utilize various descriptors to encode drugs and proteins, which are taken as inputs to random forest [3][4][5], support vector machine [6][7][8] and logistic regression [9][10][11] for predicting DTIs.With the rapid increase in bioactivity data, data distribution has become more complex.Due to its ability to automatically extract features and learn complex nonlinear relationships, deep learning is well suited to handling complex data distributions.Consequently, it is increasingly employed in DTI prediction, showcasing impressive performance.
However, deep learning methods take drugs and proteins as inputs and automatically extract features of drugs and targets through network models to DTIs [12].Recent developments in deep learning have advanced the field of DTI prediction, yielding a wide variety of methods for DTI prediction.Öztürk et al. [13] presented DeepDTA, in which one convolutional neural network (CNN) module for drug SMILES and the other CNN module for protein sequences are employed to extract their features for drugtarget affinity prediction.Thereafter, Lee et al. [14] proposed Deep-ConvDTI, which puts protein sequences and drug fingerprints into multi-scale 1D convolutional modules and fully connected layers, respectively, for the extraction of their features.Abbasi et al. [15] presented DeepCDA, which learns compound and protein features by combining CNN and long short-term memory, and employs two-sided attention to compute their interactions.With Transformer [16] achieving extraordinary success in machine translation tasks, a number of transformer-based DTI prediction models have been developed.Chen et al. [17] developed TransformerCPI, in which gated convolutional network and graph convolutional network (GCN) are employed to extract protein and drug features, respectively.Then a modified transformer is utilized to calculate interaction features for DTI prediction.
Although sequence-based DTI prediction has made remarkable progress, it has limited capabilities to capture a molecule's chemical structure, consequently adversely affecting DTI prediction.A graph-based representation can describe the topological structure of a molecule, beneficial to DTI prediction.Consequently, graph neural networks (GNNs) have been widely applied to DTI prediction, delivering superior predictive performance.Tsubaki et al. [18] utilized GNN and CNN to compute drug and protein features, respectively.They were concatenated and then fed into the classification network for prediction.Li et al. [19] utilized drug molecular graphs and 2D distance maps to represent drugs and proteins.The 2D distance maps are taken as inputs to ResNet to compute protein features.The protein features and drug molecular graphs are taken as inputs to the Interformer with interaction attention to compute interaction features.Wang et al. [20] proposed CSConv2d by improving DEEPScreen [21].The CSConv2d uses 2D images to represent compounds, and uses channel attention and spatial attention to improve its prediction performance.Torng et al. [22] proposed a GNN-based method for DTI prediction.This method firstly pretrains a protein pocket graph AutoEncoder and utilizes the encoder module to extract protein features.It then employs GCN to extract drug features, achieving impressive results in DTI prediction.Bai et al. [23] learned local interaction representations between drugs and targets through a bilinear attention network and took them as inputs to fully connected layers for DTI prediction, in which conditional domain adversarial learning was applied for better generalization.
Each modality has its own distinctive information different from other modalities.The fusion of these modalities can provide more comprehensive and richer information, which helps the model better understand and depict the features and relationships within the data, further enhancing the model's learning and generalization capabilities [24].Therefore, multimodal data are also widely applied to DTI prediction to improve prediction accuracy.Wu et al. [25] combined SMILES with Morgan fingerprints to represent drugs, and protein sequences with k-mer sequences to represent proteins.Then, CNN was used to extract drug and protein features as hidden states of nodes, and virtual nodes were added to bridge drugs and proteins to form an association graph for DTI prediction.Wang et al. [26] proposed MCPI, which extracts from protein-protein interaction network, compound-compound interaction network, Morgan fingerprint, drug-molecule distance matrix and protein sequence the drug and protein features that are fused into drug-protein features, followed by fully connected layers for DTI prediction.Hua et al. [27] proposed CPInformer for DTI prediction.The graph features extracted by GCN and the FCFP features obtained by fully connected layers from FCFPs are fused into compound features.Additionally, the multi-scale protein features are extracted by multi-scale convolutions from protein sequences.Finally, these two features are fused by the ProbSparse attention mechanism [28] and sequentially fed into convolutional layers and fully connected layers for predicting DTIs.
Although multimodal-based DTI prediction has made significant progress, they still face challenges in effectively fusing multimodal data.Most DTI prediction models often ignored the intra-modal interactions, when modeling the inter-modal interactions, thus failing to model inter-and intra-modal interactions simultaneously.In this study, to overcome the abovementioned drawbacks, we present a hierarchical multimodal selfattention-based GNN for DTI prediction, referred to as HMSA-DTI.
Each representation of drugs and proteins has its own distinctive information.For different representations of drugs and proteins, different feature extraction methods are employed to compute their features, which are then combined to improve their feature quality.Since graph-level features have an adverse effect on the multimodal feature fusion, the readout phase is removed from GNN and node-level features are kept for the multimodal feature fusion.In order to fuse the multimodal features, we propose a hierarchical multimodal self-attention to extract more fine-grained interaction features of multiple modalities of drugs and proteins.In ablation experiments, we demonstrate the benefits of our proposed hierarchical multimodal self-attention in multimodal feature fusion.Furthermore, experimental results show that our proposed HMSA-DTI outperforms some state-ofthe-art methods for DTI prediction.

Overview of HMSA-DTI
We present a hierarchical multimodal self-attention GNN (short for HMSA-DTI), as illustrated in Fig. 1.The overall framework of HMSA-DTI is composed of four components: protein feature extraction component, drug feature extraction component, multimodal feature fusion component and classifier component.In the protein feature extraction component, protein features are extracted using a CNN block with protein sequences and 2-mer sequences as input.In the drug feature extraction component, drug SMILES and molecular graphs are utilized as inputs.CNN and GNN are used to extract drug features.To capture and exploit the complex interactions between proteins and drugs to enhance their feature representation capability, we present a hierarchical multimodal self-attention mechanism.Using this hierarchical multimodal self-attention mechanism, the four features are fused and then concatenated as inputs to the classifier component for identification of DTIs.

Feature extraction for proteins
In the protein extraction module, we extract both global and local structural features of proteins to better characterize their properties.The global structural features of proteins are extracted from their sequences by a convolution block, while the local structural features are extracted from their k-mer sequences [29].

Feature extraction from protein sequences
In the same manner as Rao et al. [30], a protein sequence is encoded into a numerical vector and zero padding is applied to obtain a numerical vector of equal length x token .Then, an embedding layer is utilized to map the numerical vectors to the embedding space, resulting in the generation of the embedding matrix.This matrix is subsequently fed into a CNN block to extract features.
where X seq ∈ R Lseq×d protein , and L seq and d protein represent the spatial dimension and embedding dimension of the protein sequence feature, respectively.Emb is an Embedding layer.

Feature extraction from 2-mer protein sequences
The sliding window algorithm is used to obtain k-mer protein sequences.Specifically, a window of length k is created and slides from left to right, shifting one amino acid at a time.If the length of the protein sequence is N, we would obtain N-k+1 k-mer.In this paper, k is set to 2, ending up with a 2-mer protein sequence x mer .
where X mer ∈ R L mer ×d protein , and L mer is the spatial dimension of the 2-mer sequence feature.Enc is the numerical vector encoder in HyperAttentionDTI [31].This 2-mer sequence feature serves as its local feature representation.

Feature extraction for drugs
Drugs are typically represented as SMILES or molecular graphs.SMILES is widely employed in computational chemistry, chemoinformatics and bioinformatics due to its compact storage and easy recognition.Molecular graphs can capture the local and global context of atoms and bonds within molecules and powerful GNNs can extract features from them.In this paper, both SMILES and molecular graphs are used as drug representations to enhance feature robustness.

Features extraction from SMILES
When SMILES for a drug molecule is taken as input to a neural network, it needs to be converted into a digital vector.In this paper, the characters in SMILES are converted to integers from 1 to 64 to form a digital vector, and then it is padded with 0s to produce a vector of fixed length x smiles , which is subsequently mapped to an embedding space.Finally, a drug SMILES feature X smiles is obtained using a CNN block CNN smiles : where X smiles ∈ R Ls×d drug , L s is the spatial dimension and d drug is the embedding dimension.

Feature extraction from molecular graph
A drug molecule graph G = (A, B) is composed of a set of atoms acting as its nodes, represented by A, and a set of chemical bonds serving as its edges, denoted by B. In this representation, nodes correspond to atoms in the drug molecule, while edges represent the chemical bonds between these atoms.In this study, the initial feature x v of node v consists of eight properties: atomic number, degree (number of chemical bonds), formal charge, chirality, number of hydrogen atoms attached to the atom, hybridization, aromaticity and atomic mass.The initial feature of edge e vw includes bond type, bond position, conjugation and stereochemistry.Then a directed message passing network (D-MPNN) [32] is used to extract molecular graph features, which is composed of a message passing stage and readout stage.In contrast to traditional MPNN, D-MPNN not only uses the features of the nodes but also the directionality of edges in the process of message passing updates.
For any node v and the edge B vw associated with it, h t v and B t vw denote their hidden states at the t-th layer, respectively.The initial feature x v of node v is used as the initial hidden state h 0 v , while the hidden state h 0 vw of edge B vw is given by where W b is a parameter matrix, α denotes the activation function ReLu and concat represents a concatenation layer.The message passing stage first aggregates the features x v of node v, the feature x k of its neighbor node k and the hidden state h t kv of edge B kv to obtain the message m t+1 vw of edge B vw at t+1-th layer, which is given by where M denotes an average function and N(v) represents the set of neighbor nodes for node v.Then, the message m t+1 vw and the hidden state h t vw are utilized to calculate the hidden state h t+1 vw of edge B vw at t + 1-th layer: where f is a fully connected layer.After T iterations, we obtain the final hidden states of all edges.Then, the hidden states of edges connected to node v are aggregated to compute the message m v of node v, which is concatenated with its initial feature x v to compute its final hidden state h v : where W α is a parameter matrix.
In general, at the readout stage, average pooling or max pooling is used to pool the features from nodes to produce a single vector, that is, a graph-level feature.Since drug properties are closely related to their local structures, the pooling operation at the readout stage causes the loss of node-level features, resulting in a reduction in the accuracy of DTI prediction [33].Previous studies usually used graph-level features of drug molecular graphs to fuse with other modal features.However, multimodal fusion using graph-level features cannot capture both inter-and intramodal interactions simultaneously.Therefore, in this paper, nodelevel features X graph ∈ R Lg×ddrug rather than graph-level features are used for DTI prediction, where L g is the atomic number of a drug molecule, and d drug is its embedding dimension.Finally, our proposed hierarchical multimodal self-attention is used to fuse multimodal features of drugs and proteins to improve their feature representation ability.

Hierarchical multimodal self-attention
Multimodal learning is able to enhance feature representation by fusing features from multiple data sources to improve prediction performance [34].In the DTI prediction task, the accuracy of the prediction can be improved by combining features from drug SMILES, drug molecular graphs, protein sequences and protein 2-mer sequences.The simplest multimodal feature fusion is to concatenate features from different modes without considering inter-modal relationships.As a result, the fused features lose semantic information and inter-modal interaction information.Many research works used cross-attention to fuse multi-modal features.As far as drug representations is concerned, SMILES describes the composition of drug molecules in a linear manner, while molecular graph describes the topological relationships between atoms using a graph structure.The fusion of these two representations can enhance the model's capacity to capture drug molecules and allows extracting more robust drug features.The process for the fusion of drug features using cross-attention is shown in Fig. 2. A linear transformation is applied to the drug molecular graph feature X graph ∈ R Lg×ddrug to compute Query matrix Q graph ∈ R Lg×datt , and to the drug SMILES feature X smiles ∈ R Ls×ddrug to compute Key matrix K smiles ∈ R Ls×datt and Value matrix V smiles ∈ R Ls×datt : where d k = d att .Then, the attention matrix A and value matrix V smiles are used to fuse the SMILES features and the graph features: where a ij is an element of the attention matrix A, and v i represents the i-th row of the value matrix V smiles .Eq. (10) shows that the fusion feature H generated by cross-attention is just a weighted combination of the row vectors in the value matrix V smiles which is a linear transform of the drug SMILES feature, resulting in being short of the drug molecular graph feature.This fusion method does not take full advantage of multimodal features.In this study, we present a hierarchical multimodal selfattention mechanism to fuse multimodal features from drugs and proteins.This mechanism can capture and exploit the complex interactions between proteins and drugs, taking into account both intra-and inter-modal interactions.As illustrated in Fig. 3, this attention mechanism includes two levels.At the first level, the drug SMILES feature and molecular graph feature are fused to compute the first-level drug fusion feature H drug 1 .Similarly, the protein sequence feature and protein 2-mer feature are also fused to obtain the first-level protein fusion feature H protein 1 .We first concatenate the smiles feature X smiles with the graph feature X graph to obtain the combined drugs feature H  (11) As shown in Eq. ( 11), Q drug , K drug and V drug include both SMILES features and molecular graph features of the drug.Then Query matrix Q drug and Key matrix K drug are used to compute the attention matrix A drug : Eq. ( 12) shows that the drug attention matrix A drug includes correlations between SMILES and SMILES, between SMILES and molecular graphs, between molecular graphs and SMILES and between molecular graphs and molecular graphs.Then the drug fusion feature H drug 1 is given by where the denominator d k and the softmax function are omitted to simplify the expression.It can be seen in Eq. ( 13) that the firstlevel drug fusion feature H drug 1 includes SMILES features as well as molecular graph features.Similarly, the same feature fusion process is applied to protein sequence features and 2-mer sequence features to compute the first-level protein fusion feature H protein 1 .At the second level, H protein 1 and H drug 1 are processed in a manner similar to the first level in order to compute the final drug-target feature H.The drug-target feature H is split into the drug SMILES feature D smiles , drug graph feature D graph , protein sequence feature D seq and protein 2-mer sequence feature D mer , which are added to their initial features, that is, inputs to hierarchical multimodal self-attention.Lastly, a pooling layer is employed to compute the drug SMILES feature vector Z smiles , the drug molecular graph feature vector Z graph , the protein sequence feature vector Z seq and the protein 2-mer sequence feature vector Z mer as the outputs of the multimodal feature fusion module:

Classifier and loss function
In the classifier component, we employed four fully connected layers, each followed by a dropout layer and a LeakyReLU activation function.In addition, binary cross-entropy loss is used: where t i is the sample label with positive sample being 1 and negative sample being 0, and p i is the predicted probability.

Benchmark datasets
Our HMSA-DTI model and four baseline models were assessed on five widely used benchmark datasets: DrugBank, Human, C.elegans, BioSNAP and Davis.The Human and C.elegans are balanced datasets, which were created by Liu [35].The DrugBank is also a balanced dataset, which was created by HyperAtten-tionDTI [31].The BioSNAP is sourced from the Stanford Biomedical Network Dataset [36], and we constructed an imbalanced dataset with a positive-to-negative sample ratio of 1:10 based on this dataset.The Davis [37] is an unbalanced dataset, including 68 drugs and 379 proteins.In Table 1, a detailed description of these datasets is provided.

Experimental analysis
In this paper, we used AUC, ACC(Accuracy), Precision and Recall to evaluate the model performance on the balanced dataset and AUC and AUPR to evaluate the model performance on the unbalanced dataset.We compared our proposed HMSA-DTI model with four state-of-the-art methods: HyperAttentionDTI [31], GIFDTI [38], CoaDTI [39], MHSADTI [40].For fair comparison, all models were evaluated using 10-fold cross validation on a Linux server equipped with a GeForce RTX 3090 GPU.The hyperparameters in comparison models were set in accordance with the corresponding literature.During the experimental process, 10% of the data were selected as the test dataset, while the remaining data were divided into training and validation datasets.The test dataset is fixed across experiments and the data from the test dataset do not appear in the training and validation datasets of any crossvalidation experiment.During training, the number of epochs and the batch size were set to 150 and 50, respectively, with the bestperforming model on the validation dataset saved for testing on the test dataset.The mean of evaluation metrics for the 10-fold cross-validation results was reported.

Performance comparison on the DrugBank dataset
This section presents a comparative analysis of the proposed HMSA-DTI model and four baseline models using the DrugBank dataset.The results of the experiment, as presented in Table 2, clearly demonstrate the outstanding performance of our HMSA-DTI model with respect to four evaluation metrics: AUC, precision, recall and ACC.Compared with the top-performing baseline GIFDTI, the HMSA-DTI achieved a large improvement of 1.38% in AUC, 2.14% in precision, 0.41% in recall and 1.84% in ACC.
There are two main factors that contribute to this significant improvement.Firstly, our HMSA-DTI model takes off the readout phase in D-MPNN, allowing node-level features to be retained rather than graph-level features.This is beneficial to a thorough fusion of multimodal data from both drugs and proteins using our hierarchical multimodal self-attention.Secondly, the hierarchical multimodal self-attention based feature fusion approach allows the simultaneous extraction of inter-and intra-modal interactions, resulting in improved feature representation capabilities.To better illustrate the differences in model performance, we provide the PR and ROC curves of the evaluated models on the DrugBank dataset, as displayed in Fig. 4. It is demonstrated that the HMSA-DTI model consistently outperforms the baseline models in prediction performance, showing its superiority and effectiveness in DTI prediction.

Performance comparison on the Human dataset
This section provides a performance comparison of HMSA-DTI model and four baseline models on the Human dataset.As illustrated in Table 3, our HMSA-DTI model shows superior performance on four metrics: AUC, precision, ACC and recall.Against the best-performing baseline model, HyperAttentionDTI, HMSA-DTI achieves improvements of 1.32% in AUC, 0.99% in precision and 1.26% in ACC.Furthermore, HMSA-DTI shows an improvement of 1.33% over the CoaDTI model, which ranks first in terms of a recall metric among all baseline models.It is shown that our HMSA-DTI model has excellent predictive performance on a Human dataset.

Performance comparison on the C.elegans dataset
A comparison between HMSA-DTI model and four baseline models was made on the C.elegans dataset in this section.As shown in Table 4, compared with the best-performing baseline model HyperAttentionDTI on this dataset, HMSA-DTI achieved an improvement of 0.18%, 0.19% and 0.46% in terms of AUC, ACC and recall, respectively.Experimental results on the three

Performance comparison on the Davis dataset
In this section, we compared the performance of the HMSA-DTI model with four baseline models on the Davis dataset.Since the Davis dataset is an unbalanced dataset, we used AUC and AUPR to evaluate these models.As shown in Table 5, compared with the best-performing baseline model HyperAttentionDTI, our HMSA-DTI achieved an improvement of 0.2% and 1.12% in terms of AUC and AUPR, respectively.To better illustrate the differences in model performance, the PR and ROC curves on the Davis dataset are displayed in Fig. 5.

Performance comparison on an independent dataset
Although our HMSA-DTI model outperformed the baseline models on the benchmark Datasets, to verify the generalization ability of the models, we trained our model on the DrugBank dataset and tested it on the BioSNAP dataset.During training, 80% of the DrugBank dataset was used as the training dataset, while the remaining 20% of the data were used as the validation dataset.
All data from the BioSNAP dataset served as the test dataset.The experimental results in Table 6 show that compared with all baselines, HMSA-DTI outperformed all baselines by at least 3% on both AUC and AUPR metrics.

Ablation experiments
A set of ablation experiments was carried out on the DrugBank dataset to test the effectiveness of the various modules within our model, and the results are shown in Table 7.We first replaced the hierarchical multimodal self-attention based feature fusion method in the HMSA-DTI with the concatenation-based feature fusion method to obtain the HMSA-DTI-Concat model.According to our results, we found that the HMSA-DTI, which includes hierarchical multimodal self-attention, outperforms the HMSA-DTI-Concat model on all four evaluation metrics.Our HMSA-DTI model outperforms the HMSA-DTI-Concat model by 0.61% in AUC, 0.72% in ACC, 0.76% in precision and 0.64% in recall metrics, respectively.It is shown that our hierarchical multimodal self-attention mechanism is beneficial for fusing multimodal features and can improve the feature representation ability of drugs and proteins.In addition, we used two attention variants Non-Local [41] and Cross-Attention [42] to construct HMSA-DTI-NL and HMSA-DTI-CA, respectively.It is found that the HMSA-DTI

K-value analysis
When proteins are represented by k-mer sequences, the choice of K-value will impact model performance.In general, a smaller value of K makes the model focus more on the local structure of proteins, whereas a larger value of K makes the model puts more emphasis on the global structure of proteins.When K=1, 2, 3, the corresponding vocabulary sizes are 22, 22 2 = 484 and 22 3 = 10 648, respectively.When K=4, the size of the vocabulary is 22 4 = 234 256, yielding an exponential increase in model complexity.Therefore, this study only evaluated the performance of K=1, 2 and 3.During the K-value analysis, we kept all other parameters fixed, including the epoch number, learning rate, batch size, etc.We adjusted the K-value to perform experiments and determined the optimal K-value according to DTI prediction performance under crossvalidation.As shown in Fig. 6, the model with K=2 performed best on ACC and AUC metrics.Therefore, 2-mer sequences are selected to represent proteins.

Over-smoothing analysis
During the training process, increasing the number of layers of D-MPNN components in the HMSA-DTI causes the hidden states to converge to similar values, producing over-smoothing and thereby decreasing the feature representation ability [43].Therefore, we performed a range of experiments to assess the inf luence of the number of layers of the D-MPNN component on the prediction performance of our HMSA-DTI model.During the over-smoothing analysis, the optimal K-value was selected and all the remaining parameters were fixed except for the number of D-MPNN layers.We adjusted the number of D-MPNN layers to perform experiments and selected the optimal number of layers based on DTI prediction performance under cross-validation.As shown in Fig. 7, the HMSA-DTI with D-MPNN of two layers performed best in terms of ACC and AUC metrics.Therefore, we used D-MPNN with two layers in all experiments.

Case study
In this section, we randomly select a drug and a protein as test candidates.These selected test candidates were then removed from the DrugBank dataset, and the rest were used to construct the training dataset.This validation approach allows us to assess the robustness and generalizability of the HMSA-DTI model and provides insights into its performance in real-world scenarios.In this specific experiment, we randomly chose the drug Estradiol acetate (DrugBankID: DB13952) and the target protein gammaaminobutyric acid receptor subunit rho-3 (UniprotID: A8MPY1) as the test candidates.We selected 10 proteins from the DrugBank dataset that interact with Estradiol acetate and used them to create 10 positive samples.For the negative samples, we generated 4244 negative samples by combining Estradiol acetate with each of the remaining 4244 proteins in the dataset.The 10 positive samples and 4244 negative samples comprise the testing dataset.
We then ranked the candidate target proteins by the predicted interaction scores and listed the top 20 target proteins for evaluation in Table 8.It is demonstrated that the HMSA-DTI successfully predicted all positive samples, five of which ranked among the top 20.Similarly, for the target protein gamma-aminobutyric acid receptor subunit rho-3, we followed the same way to create the test dataset, which consists of 15 positive samples and 6630 negative samples.As shown in Table 9, the HMSA-DTI also accurately predicted all positive samples, three of which are among the top 20.These results show the effectiveness of HMSA-DTI in predicting DTI while showing its robustness and generalizability

Conclusion
In this study, we present a hierarchical multimodal self-attentionbased GNN for DTI prediction, namely HMSA-DTI.HMSA-DTI utilizes multimodal data to improve its feature representation ability and employs a hierarchical multimodal self-attention feature fusion approach to fuse node-level features of molecular graphs with SMILES sequence features, yielding more robust drug features.In addition, a protein sequence representation is combined with a protein 2-mer sequence representation, and their features are fused by hierarchical multimodal self-attention to enhance protein representation ability.We validated our HMSA-DTI model on five benchmark datasets: DrugBank, Human, C. elegans, BioSNAP and Davis.It is shown that our HMSA-DTI model outperforms the baseline models, demonstrating its strong competitiveness and effectiveness.

Key Points
• We took multimodal data such as SMILES, drug molecular graphs, protein sequences and 2-mer sequences as inputs.Additionally, the readout phase is removed from D-MPNN to obtain node-level features instead of graphlevel features, beneficial to the following multimodal feature fusion.• Hierarchical multimodal self-attention was proposed to improve the discriminability and robustness of features by computing both intra-and inter-modal interactions.• Five benchmark datasets were used to validate our proposed HMSA-DTI.It is demonstrated that compared with other baseline models, our proposed HMSA-DTI exhibits superior performance on multiple metrics, and has strong competitiveness.

Figure 1 .
Figure 1.The overall architecture of HMSA-DTI.After mapping each of the four inputs to the embedding space, HMSA-DTI uses CNN and D-MPNN to extract features, fuses the multimodal features with a hierarchical multimodal self-attention mechanism to capture DTIs and finally the classifier outputs the predicted scores.

Figure 2 .
Figure 2. The diagram for cross-attention.

5 Figure 3 .
Figure 3. Hierarchical multimodal self-attention mechanism.The red dashed box and blue dashed box represent the first-level and second-level attention modules, respectively.⊕ denotes the concatenation operation and ⊗ denotes the matrix multiplication.

Figure 4 .
Figure 4. ROC curve and PR curve of the HMSA-DTI model and baseline models on DrugBank dataset.

Figure 5 .
Figure 5. ROC curve and PR curve of the HMSA-DTI model and baseline models on Davis dataset.

Figure 6 .
Figure 6.The impact of K value on ACC and AUC.

Table 1 .
The detailed description of benchmark datasets

Table 2 .
The performance comparison between the HMSA-DTI model and the baseline models on the DrugBank dataset

Table 3 .
The performance comparison between the HMSA-DTI model and the baseline models on the Human dataset

Table 4 .
The performance comparison between the HMSA-DTI model and the baseline models on the C.elegans dataset

Table 5 .
The performance comparison between the HMSA-DTI model and the baseline models on the Davis dataset

Table 6 .
The performance comparison between the HMSA-DTI model and the baseline models on the independent dataset