Genome-scale annotation of protein binding sites via language model and geometric deep learning

Revealing protein binding sites with other molecules, such as nucleic acids, peptides, or small ligands, sheds light on disease mechanism elucidation and novel drug design. With the explosive growth of proteins in sequence databases, how to accurately and efficiently identify these binding sites from sequences becomes essential. However, current methods mostly rely on expensive multiple sequence alignments or experimental protein structures, limiting their genome-scale applications. Besides, these methods haven’t fully explored the geometry of the protein structures. Here, we propose GPSite, a multi-task network for simultaneously predicting binding residues of DNA, RNA, peptide, protein, ATP, HEM, and metal ions on proteins. GPSite was trained on informative sequence embeddings and predicted structures from protein language models, while comprehensively extracting residual and relational geometric contexts in an end-to-end manner. Experiments demonstrate that GPSite substantially surpasses state-of-the-art sequence-based and structure-based approaches on various benchmark datasets, even when the structures are not well-predicted. The low computational cost of GPSite enables rapid genome-scale binding residue annotations for over 568,000 sequences, providing opportunities to unveil unexplored associations of binding sites with molecular functions, biological processes, and genetic variants. The GPSite webserver and annotation database can be freely accessed at https://bio-web1.nscc-gz.cn/app/GPSite.


Introduction
Proteins perform most biological functions by specifically interacting with other molecules such as nucleic acids, peptides, proteins, or small ligands of various kinds (Alipanahi et al., 2015;Rolland et al., 2014;Andreini et al., 2009).Knowledge of these binding interfaces benefits protein function prediction, disease mechanism understanding, and novel drug design (Lee et al., 2007;Wang et al., 2017;Wells and McClendon, 2007).Although experimental techniques such as X-ray crystallography and nuclear magnetic resonance spectroscopy can solve the native complex structures to detect binding sites, they are costly, time-consuming, and unsuitable for proteins with unknown binding partners.With the explosive growth of proteins in sequence databases (UniProt Consortium, 2023;Sayers et al., 2023), developing effective and efficient computational methods to recognize potential binding regions from sequences in a large-scale manner is imperative to fill the gap between genome and phenome.
A conventional way to predict binding interfaces is comparative modeling, which employs alignment algorithms to transfer known binding residues from similar templates to the query proteins (Zhang et al., 2010;Yang et al., 2013;Esmaielbeiki et al., 2016).Nevertheless, this strategy will be seriously restricted when no high-quality template exists.Therefore, methods based on machine learning and deep learning have prevailed in recent years, which can be divided into sequence-based and structure-based approaches according to their used protein information.Sequence-based methods leverage machine learning classifiers (e.g.support vector machine) to learn local binding characteristics from sequence contexts in a sliding window (Yan and Kurgan, 2017;Taherzadeh et al., 2016a;Zhao et al., 2018;Yu et al., 2013), or employ deep learning models like transformer Vaswani et al., 2017 to capture global dependencies (Wang et al., 2022;Yuan et al., 2022c).Despite requiring only readily available protein sequences, these predictors are of limited accuracy due to the lack of tertiary structure information.On the other hand, experimental structure-based approaches are often more effective.Earlier methods encode protein structures as 2D images (Xia et al., 2020) or 3D voxels (Jiménez et al., 2017), which are processed via grid-based convolutional neural networks.Current approaches tend to handle protein structures as graphs composed of surface point clouds (Gainza et al., 2020;Li and Liu, 2023), atoms (Tubiana et al., 2022;Krapp et al., 2023b) or residues (Xia et al., 2021;Yuan et al., 2021), and adopt geodesic convolution or graph neural networks (GNNs) to learn the binding-relevant spatial patterns.Unfortunately, the expressive capacities of most methods remain restricted, as the geometry of the structure is not yet fully explored.More importantly, both present sequence-and structure-based predictors are hampered for high-throughput practices at the genome scale for two reasons.Firstly, the dependency on evolutionary profiles from multiple sequence alignments (MSA) for most methods leads to high computational expense and occasionally subpar performance for shallow sequence alignments.Secondly, albeit powerful when native structures are available, structure-based methods will exhibit performance declines for unbound or predicted structures, probably owing to their sensitivity towards details and errors in the structures.
Our previous work (Yuan et al., 2022b) has validated the feasibility of exploiting predicted structures from AlphaFold2 (Jumper et al., 2021) for training to enhance the model's robustness, yet the computationally intensive structure prediction pipeline still restrains its application to novel sequences absent from the AlphaFold Protein Structure Database (Tunyasuvunakool et al., 2021).Since protein sequence can be regarded as a language in biology, unsupervised pre-training with language models has recently been applied to protein sequence representation learning and has displayed competitive or better results than manually engineered evolutionary features in different downstream tasks (Yuan et al., 2022c;Rives et al., 2021;Elnaggar et al., 2022;Unsal et al., 2022;Yuan et al., 2023c).Based on this, ESMFold (Lin et al., 2023) replaces evolutionary information from MSA with a large-scale pretrained protein language model to directly infer atomic-level protein structure from single sequence.This results in an order-of-magnitude acceleration of prediction while maintaining accuracy nearly as alignment-based methods including AlphaFold2.Therefore, it is promising to develop a fast and accurate model tailored for large-scale sequence-based binding site prediction based on the recent advances in protein representation learning and structure prediction with language models.
To facilitate protein structure modeling, geometric deep learning techniques have recently flourished in protein docking (Stärk et al., 2022), protein structure pre-training (Zhang et al., 2022b), protein design (Dauparas et al., 2022;Gao et al., 2022), and binding site prediction (Gainza et al., 2020;Li and Liu, 2023;Tubiana et al., 2022;Krapp et al., 2023b), since they can better manipulate data with no natural grid-like topology than 3D convolutional neural networks.Among these, current geometric binding site predictors are mostly built on surface point clouds (Gainza et al., 2020;Li and Liu, 2023) or atom graphs (Tubiana et al., 2022;Krapp et al., 2023b).However, although the binding interface is mainly comprised of surface atoms, the global structure of the protein in general influences how the interface is formulated and how the binding partner is interacted with, which should be modeled.Besides, the calculation of protein surfaces and mapping of their properties are usually time-consuming, while methods based on full atom graphs are memory-consuming and thus difficult to process long sequences.Consequently, designing a geometry-aware message passing mechanism on residue graphs to synergistically integrate sequence and structure information is potentially more suitable for the binding site prediction task.
In this study, we propose GPSite (Geometry-aware Protein binding Site predictor), a fast, accurate, and versatile network for concurrently predicting binding residues of 10 types of biologically relevant molecules including DNA, RNA, peptide, protein, ATP, HEM, and metal ions (Zn 2+ , Ca 2+ , Mg 2+ , Mn 2+ ) in a multi-task framework.GPSite was trained on informative sequence embeddings and predicted structures generated by pre-trained protein language models, and thus does not rely on MSA or native structures.To better capture the high-level bio-physicochemical characteristics in the predicted structures, a comprehensive geometric featurizer along with an edge-enhanced graph neural network is designed to extract the residual and relational geometric contexts in an end-to-end manner.Experiments demonstrate that GPSite substantially surpasses state-of-the-art sequencebased and structure-based approaches on various benchmark datasets, even under conditions where the predicted structures are of lower quality.GPSite runs fast enough to process genome-scale sequence databases such as the entire Swiss-Prot (UniProt Consortium, 2023), allowing for rapid binding residue annotations for over 568,000 sequences.Further analyses indicate that such annotations can promote discoveries in associations of binding sites with molecular functions, biological processes, and genetic variants.Besides the standalone code, we also provide the GPSite webserver and annotation database at https://bio-web1.nscc-gz.cn/app/GPSite.

Results
The geometry-aware protein binding site predictor (GPSite) GPSite is a geometry-aware versatile protein binding site predictor that can fast and accurately identify binding residues of DNA, RNA, peptide, protein, ATP, HEM, and metal ions (Zn 2+ , Ca 2+ , Mg 2+ , Mn 2+ ) from protein sequences.As shown in Figure 1 and detailed in Methods, an input protein sequence is fed to the pre-trained language model ProtTrans (Elnaggar et al., 2022) and the folding model ESMFold to generate informative sequence embedding and predicted structure, respectively.From the predicted structure, the coordinates of the N, C α , C, and O atoms as well as the centroid of the heavy sidechain atoms are gathered, and DSSP (Kabsch and Sander, 1983) is employed to calculate the relative solvent accessibility and secondary structure for each residue.Then, a protein radius graph is built where residues constitute the nodes and adjacent nodes are connected by edges.In addition to the residue features by ProtTrans and DSSP, an end-to-end geometric featurizer is designed to construct a local coordinate system for each residue and extract geometric features capturing the arrangements of backbone and sidechain atoms in or between residues.Specifically, the geometric node features consist of intra-residue distances between two atoms (including the sidechain centroid), relative directions of other inner atoms or sidechain centroid to C α , as well as the bond and torsion angles.Similarly, the geometric edge features comprise inter-residue distances between atoms from the two neighboring residues, relative directions of all atoms in the adjacent residue to C α of the central residue, and spatial orientation between the two reference frames of the neighboring nodes.
To learn the residue representations by considering multi-scale interactions in node, edge, and global context levels, we design a GNN with message passing, edge update and global node update acting on this geometric-aware attributed graph.The message passing layer adopts the multi-head attention in transformer enhanced by edge features to aggregate information from the local neighborhood and update the central node.Then the features of an edge are updated using its connecting nodes.Finally, a gated attention is applied to update the node states using the global context information.Benefiting from the above pipeline, GPSite is invariant to rotation and translation.Besides, GPSite leverages the multi-task learning strategy, where the GNN is shared among different ligands to capture the common binding-relevant characteristics, followed by 10 ligand-specific multilayer perceptrons (MLPs) to mine the binding patterns of particular molecules.This framework also reduces the time for inference of multiple ligands by the concurrent prediction fashion.In conclusion, GPSite is distinguished from the previous approaches in four key aspects.First, profiting from the effectiveness and low computational cost of ProtTrans and ESMFold, GPSite is liberated from the reliance on MSA and native structures, thus enabling genome-wide binding site prediction.Second, unlike methods that only explore the C α models of proteins (Xia et al., 2021;Ingraham et al., 2019), GPSite exploits a comprehensive geometric featurizer to fully refine knowledge in the backbone and sidechain atoms.Third, the employed message propagation on residue graphs is global structure-aware and time-efficient compared to the methods based on surface point clouds (Gainza et al., 2020;Li and Liu, 2023), and memory-efficient unlike methods based on full atom graphs (Tubiana et al., 2022;Krapp et al., 2023b).Residue-based message passing is also less sensitive towards errors in the predicted structures.Last but not least, instead of predicting binding sites for a single molecule type or learning binding patterns separately for different molecules, GPSite applies multi-task learning to better model the latent relationships among different binding partners.The overview of GPSite.The protein sequence is input to the pre-trained language model ProtTrans and the folding model ESMFold to generate the sequence embedding and predicted structure, respectively.According to the structure, a protein radius graph is constructed where residues constitute the nodes and adjacent nodes are connected by edges.In addition to the pre-computed residue features of ProtTrans embedding and DSSP structural properties, a comprehensive, end-to-end geometric featurizer is employed to extract the geometric node features including distance, direction and angle, as well as geometric edge features between residues including distance, direction and orientation.Here, the R group denotes the centroid of the heavy sidechain atoms.The resulting geometric-aware attributed graph is input to a shared GNN to perform edgeenhanced message passing for capturing the common binding-relevant characteristics among different molecules.Finally, 10 ligand-specific MLPs are adopted to learn the binding patterns of particular molecules in a multi-task manner.Examples of the applications of GPSite include binding site identification and protein-level Gene Ontology (GO; Ashburner et al., 2000) function prediction.

GPSite outperforms state-of-the-art methods
We collected 10 binding site benchmark datasets for the 10 ligands from Protein Data Bank (PDB; Berman et al., 2000) as detailed in Methods, which were combined to train and evaluate GPSite using the five-fold cross-validation and independent test sets.As shown in Appendix 2-table 2, GPSite obtains average area under the receiver operating characteristic curve (AUC) values over the 10 ligand types of 0.918 and 0.921; as well as average area under the precision-recall curve (AUPR) values of 0.603 and 0.594 on the cross-validation and independent tests, respectively.The consistent performance on the validation and test sets indicates the robustness of our model.In Figure 2A, the receiver operating characteristic (ROC) curves and the precision-recall curves on the 10 test sets are plotted to overview the performance of GPSite for different ligands.
Although trained on predicted protein structures, GPSite can also adopt native structures as input for prediction whenever applicable.By doing this, extra performance boosts can be gained with average AUPR increase of 7.8% (Appendix 3-figure 2).However, experimental structures are not always available in real-world scenarios, such as genome-scale sequence databases.To this end, for the best experimental structure-based method (measured by AUPR) in each test set, we also investigated the impact on performance when using ESMFold-predicted structures as input.As expected, the performance of these methods mostly decreases substantially utilizing predicted structures for testing, because they were trained with high-quality native structures.For example, the AUPR of GraphBind for predicting RNA-binding sites decreases from 0.506 to 0.433, compared to the AUPR of 0.573 by GPSite.Similarly, the AUPR of ScanNet drops from 0.476 to 0.399, compared to the AUPR of 0.484 by GPSite for predicting protein-binding sites.Therefore, in the practical situations where experimental structures are unavailable, the superiority of our method will be further reflected.

GPSite is robust for low-quality predicted structures
Since GPSite is built on ESMFold, it is necessary to examine the quality of the predicted structures and its impact on the model performance.Figure 3B and Appendix 3-figure 3 show the distributions of TM-scores between native and predicted structures calculated by US-align (Zhang et al., 2022a) in the 10 benchmark datasets, where most proteins are accurately predicted with TM-score >0.7 (see also Appendix 2-table 5).Overall, ESMFold achieves median TM-scores of 0.89, 0.76, 0.93, 0.93, 0.94, 0.94, 0.93, 0.94, 0.95, and 0.96 for the DNA, RNA, peptide, protein, ATP, HEM, Zn 2+ , Ca 2+ , Mg 2+ , and Mn 2+ datasets, respectively (Appendix 2-table 6).We next explored whether GPSite can maintain its performance on low-quality predicted structures.Figure 3A presents the performance of GPSite on ESMFold-predicted structures with TM-score >0.7 or ≤0.7, and the comparisons with the leading structure-based methods in the test sets of DNA, RNA, and peptide.Reasonably, compared to the well-predicted proteins, the performance of GPSite is inferior on the subsets of proteins with TM-score ≤0.7.Nevertheless, GPSite continues to outshine the most advanced structure-based methods input with ESMFold-predicted structures or even experimental structures.Similar trends are also observed for the rest of the ligands in Appendix 3-figure 4. Given the infrequency of low-quality predicted structures except for the RNA test set, we took a closer inspection of the 104 proteins with predicted structures of TM-score <0.5 in the RNA test set.In this subset, GraphBind achieves AUPR values of 0.455 and 0.376 using native and predicted structures, respectively, compared to the AUPR of 0.516 by GPSite.As shown in Figure 3C with lines fit to the per-protein TM-score and AUPR using linear regression, GPSite consistently outperforms GraphBind using predicted structures regardless of the prediction quality of ESMFold, and is only surpassed by GraphBind input with native structures on proteins of extremely low quality (TM-score <0.3).An example is presented in Case study for the ribosome biogenesis protein ERB1 for illustration.To sum up, ESMFold could produce high-quality structures in most cases, and even for the low-quality predicted structures, GPSite is robust enough to generate reliable predictions better than current state-of-the-art structure-based methods.
We finally illustrate a potential reason for the robustness of GPSite by an example from the test set where GPSite is able to discern among various interfaces even though the structure is not perfectly predicted.Figure 3D shows the structure of the human glucocorticoid receptor (GR), a transcription factor that binds DNA and assembles a coactivator peptide to regulate gene transcription (PDB: 7PRW, chain A).The DNA-binding domain of GR also consists of two C4-type zinc fingers to bind Zn 2+ ions.Although the structure of this protein is not perfectly predicted (TM-score=0.72),the local structures of the binding domains of peptide and DNA are actually predicted accurately as viewed by the superpositions of the native and predicted structures in Figure 3D and E. Therefore, GPSite can correctly predict all Zn 2+ binding sites and precisely identify the binding sites of DNA and peptide with AUPR values of 0.949 and 0.924, respectively (Figure 3F, G and H).

The effects of protein features and model designs
To reveal the roles of distinct protein features and model designs in GPSite, we conducted comprehensive ablation studies.As shown in Figure 4A and Appendix 2-table 7, when removing the Prot-Trans embeddings from GPSite, the model yields inadequate performance (average AUPR of 0.516 among the 10 test sets) due to the complete neglect of the sequence information in proteins.The introduction of one-hot sequence encodings or MSA profile (elaborated in Generation of the evolutionary features from MSA) partially restores the performance to average AUPR of 0.545 or 0.568, respectively.Nevertheless, the utilization of language model representations in GPSite attains the highest average AUPR of 0.594.To further understand the advantages of ProtTrans over the evolutionary features from MSA, we compared their performance against the number of effective homologous sequences (Neff) of the proteins from the combined test set of the 10 ligands.Neff is an HHblits (Remmert et al., 2011) parameter quantifying the effective size of homologous sequence cluster.As evidenced in Figure 4B and Appendix 2-table 8, ProtTrans consistently obtains competitive or superior performance compared to the MSA profile.Notably, for the target proteins with few homologous sequences (Neff <2), ProtTrans surpasses MSA profile significantly with an improvement of 3.9% on AUC (p-value = 4.3 × 10 -8 ).On the other hand, removing the structure information (implemented by a transformer model solely input with ProtTrans sequence features) obtains the worst performance with an average AUPR of 0.484 (Figure 4A).This observation indicates that the knowledge of protein structure may be more critical than sequence information in binding site prediction tasks.Additionally, the removal of the geometric featurizer within GPSite also causes a substantial decline in performance (average AUPR from 0.594 to 0.533), attesting to the significance of GPSite's perception of protein geometry.We also assessed the benefit of training with predicted instead of native structures, which brings an average AUPR increase of 4.2% as detailed in The effect of training with predicted structures.
We next elucidate the benefits of the multi-task framework in GPSite by comparing it with a baseline approach in which a model is trained and evaluated for each dataset separately.As depicted in Figure 4C and Appendix 2-table 7, GPSite consistently outperforms the single-task baseline, especially for the ligands with limited training data.For instance, directly fitting a model on the HEM training set with only 176 proteins reaches an AUPR of 0.716 for the test set.Alternatively, combining datasets of diverse ligands in a multi-task framework brings an AUPR increase of 0.086 for HEM.This suggests that multi-task learning can compensate for the scarcity of training data by leveraging a shared network trained on a larger dataset encompassing different types of ligand-binding proteins that potentially share similar binding patterns.We also conducted cross-type evaluations to investigate the specificity of the ligand-specific MLPs and the inherent similarities among different ligands in The cross-type performance of the multi-task network in GPSite.
Residues that are conserved during evolution, exposed to solvent, or inside a pocket-shaped domain are inclined to participate in ligand binding.During the preceding multi-task training process, the shared network in GPSite should have learned to capture such common binding mechanisms.Here we show how GPSite can be easily extended to the binding site prediction for other unseen ligands by adopting the pre-trained shared network as a feature extractor.We considered a carbohydrate-binding site dataset from Sun et al., 2022 which contains 100 proteins for training and 49 for testing.We first visualized the distributions of residues in this dataset using t-SNE (van der Maaten and Hinton, 2008), where the residues are encoded by raw feature vectors encompassing ProtTrans embeddings and DSSP structural properties, or latent embedding vectors from the shared network of GPSite trained on the 10 molecule types previously.As shown in Figure 4D, the binding and non-binding residues overlap and are indistinguishable when encoded by raw feature vectors.On the contrary, the latent representations from GPSite effectively improve the discriminability between the binding and non-binding residues.Employing these informative hidden embeddings as input features to train a simple MLP exhibits remarkable performance with an AUC of 0.881 (Figure 4E), higher than that of training a single-task version of GPSite from scratch (AUC of 0.853) or other state-of-the-art methods such as MTDsite (Sun et al., 2022) and SPRINT-CBH (Taherzadeh et al., 2016b).These results highlight the effectiveness of multi-task learning and the scalability of GPSite to unseen ligands.

Large-scale binding site annotation for Swiss-Prot
In light of the efficiency and effectiveness of GPSite, we sought to annotate and analyze the potential binding interfaces of various kinds for the entire Swiss-Prot database.For this task, we applied ESMFold to predict the structures of 568,326 sequences in Swiss-Prot, which required approximately 8.5 days as described in Methods.Typically, it takes 16 s to predict the structure of a protein with 500 residues, or 100 s for 1000 residues (Appendix 3-figure 6).The feature extraction and GPSite inference procedures overall cost about 5 hr.All ESMFold-predicted structures accompanied by the binding site annotations for Swiss-Prot are freely available in our user-friendly GPSiteDB database (https://bio-web1.nscc-gz.cn/database/GPSiteDB).Appendix 3-figure 7 further illustrates the distributions of the protein length and the predicted TM-score (pTM) estimated by ESMFold for the Swiss-Prot sequences, where most proteins are no longer than 500 residues and predicted with high confidence (pTM >0.8).For the subsequent downstream analyses, we only considered the predicted structures with pTM >0.7, resulting in a total of 370,140 structures.
Exploiting the residue-level binding site annotations, we could readily extend GPSite to discriminate between binding and non-binding proteins of various ligands.Specifically, a protein-level binding score indicating the overall binding propensity to a specific ligand can be generated by averaging the top k predicted scores among all residues.Empirically, we set k to 5 for metal ions and 10 for other ligands, considering the distributions of the numbers of binding residues per sequence observed in the training set.As depicted in Figure 5A, the GPSite binding scores for proteins with the corresponding ligand-binding molecular functions are significantly higher than those without such annotations in Swiss-Prot (p-value <10 -165 for all ligands according to Mann-Whitney U test; Mann and Whitney, 1947).The accuracy of the GPSite protein-level binding scores is further validated by the ROC curves in Figure 5B, where GPSite achieves satisfactory AUC values for all ligands except protein (AUC of 0.608).This may be ascribed to the fact that protein-protein interactions are ubiquitous in living organisms while the Swiss-Prot function annotations are incomplete (see GPSite is effective for completing the function annotations in Swiss-Prot).Moreover, we attempted to gather the top 20,000 proteins with the highest GPSite binding scores for each ligand to expand the binding function annotations in Swiss-Prot.We could immediately notice that the GPSite-predicted binding proteins are involved in biological processes consistent with existing knowledge as shown in Figure 5C and Appendix 3-figure 8.For instance, the DNA-binding proteins predicted by GPSite are prone to participate in DNA repair, DNA-templated transcription, DNA recombination and replication, while the RNA-binding proteins are inclined to perform translation and RNA modification.
Capitalizing on the predicted structures and annotations within GPSiteDB, cell biologists are empowered to easily locate the genetic variants and assess their potential disruptions in proteinligand interactions and pathogenicity.This facilitates the establishment of rational working hypotheses to propel therapeutic development in a more informed manner.Here we conduct analyses on the associations between binding sites and genetic variants for the human proteome as an example.Notably, 20.67% of the pathogenic variant sites on the surfaces of the predicted structures fall in the GPSite-predicted interfaces, higher than the benign variants (11.97%) or the random baseline (16.35%) as described in Figure 5D.Consistent trend is observed in Appendix 3-figure 9 when considering variants in the entire structure (rather than solely on surface).Besides, we investigated the pathogenic probabilities of variants in different locations in Figure 5E.As expected, the pathogenicity of variants located in the predicted binding sites is higher than those in the non-binding sites.Interestingly, our analysis uncovered that the pathogenic probabilities of variants in the predicted binding sites of ATP and metal ions surpass those of other ligands.One possible reason is that the binding interfaces of ATP and metal ions typically comprise small pockets formed by a limited number of residues.Consequently, variants affecting even a single residue within such pockets may exert a substantial influence on the overall pocket functionality and lead to diseases.

Discussion
In this study, we present GPSite to accurately and efficiently predict protein binding sites of diverse biologically relevant molecules including DNA, RNA, peptide, protein, ATP, HEM, and metal ions.By leveraging the informative sequence embeddings and predicted structures from pre-trained language models, GPSite is liberated from the reliance on MSA or experimental protein structures.To encapsulate the high-level bio-physicochemical characteristics in the predicted structures, GPSite incorporates a comprehensive geometric featurizer and an edge-enhanced graph neural network to refine both residual and relational geometric contexts in an end-to-end manner.GPSite also stands out from the previous approaches by applying a multi-task framework to effectively model the intrinsic relationships among different binding partners.Results across various benchmark datasets indicate that GPSite substantially outperforms state-of-the-art sequence-based and structure-based methods, even under conditions where the predicted structures are of lower quality.Finally, we demonstrate GPSite's scalability to genome-scale sequence databases by annotating binding sites for over 568,000 sequences in Swiss-Prot within 9 days.Further analyses suggest that these annotations are not only accordant with existing knowledge but also capable of facilitating discoveries of unexplored biology in protein function and genetic variant.
Despite the noteworthy advancements achieved by GPSite, there remains scope for further improvements.GPSite may be improved by pre-training (Zhang et al., 2022b) on the abundant predicted structures in ESM Metagenomic Atlas (Lin et al., 2023), and then fine-tuning on binding site datasets.Besides, the hidden embeddings from ESMFold may also serve as informative protein representations.Additional opportunities for upgrade exist within the network architecture.For example, a variational Expectation-Maximization framework (Zhao et al., 2022) can be adopted to handle the hierarchical atom-to-residue graph structure inherent in proteins.Meta-learning (Finn et al., 2017) could also be explored in this multi-task scenario, which allows fast adaptation to unseen tasks with limited labels.
As the gap between unannotated and annotated sequences is expanding at an unparalleled rate, GPSite serves as a reliable, efficient, versatile and user-friendly tool for unraveling the extensive and dynamic landscape of protein-ligand interactions.By harnessing the capabilities of GPSite, researchers can readily uncover fresh biological functions of proteins, gain valuable insights into the underlying pathogenic mechanisms of gene mutations, or design novel drugs targeting specific binding pockets.

Benchmark datasets
The benchmark datasets for evaluating binding site predictions of DNA, RNA, peptide, ATP, and HEM are constructed from BioLiP (Zhang et al., 2024), a database of biologically relevant protein-ligand complexes primarily from PDB.For each ligand, we collected the corresponding binding proteins with resolutions of ≤3.0 Å and lengths of 50-1500 from BioLiP released on 29 March 2023.A binding residue is defined if the smallest atomic distance between the target residue and the ligand is <0.5 Å plus the sum of the Van der Waal's radius of the two nearest atoms.We combined the binding site annotations of identical sequences and then removed redundant proteins sharing sequence identity >25% over 30% alignment coverage using CD-HIT (Fu et al., 2012).Finally, each benchmark dataset was split into a training set with proteins released before 1 January 2021, as well as an independent test set with proteins released from 1 January 2021 to 29 March 2023.Besides, the benchmark dataset of protein-protein binding sites is directly from Yuan et al., 2021, which contains non-redundant transient heterodimeric protein complexes dated up to May 2021.Surface regions that become solvent inaccessible on complex formation are defined as the ground truth protein-binding sites.The benchmark datasets of metal ion (Zn 2+ , Ca 2+ , Mg 2+ , and Mn 2+ ) binding sites are directly from Yuan et al., 2022c, which contain non-redundant proteins dated up to December 2021 from BioLiP.Combining all these 10 datasets results in a total of 8441 training sequences and 1838 test sequences.Details of the statistics of these benchmark datasets are given in Appendix 2-table 1.

Structure prediction and preprocessing
We harnessed ESMFold, a fast and accurate end-to-end model to predict protein structures from sequences.ESMFold is based on a language model with 3B parameters pre-trained over sequences in UniRef50 (Suzek et al., 2007), and a folding head similar to AlphaFold2 trained on experimental structures from PDB and predicted structures from AlphaFold2.The structure prediction for our whole benchmark datasets (~10,200 sequences, ~300 amino acids on average) cost only ~28 hr on an NVIDIA A100 GPU.For each residue in the predicted structures, we gathered the coordinates of the N, C α , C and O atoms as well as the centroid of the heavy sidechain atoms (denoted as R).In this way, the structure of a protein can be represented by a coordinate matrix X ∈ R n×5×3 , with n denoting the number of residues.

Protein features
GPSite leverages the pre-trained protein language model ProtTrans (version: ProtT5-XL-U50) to generate sequence features efficiently, thus bypassing slow sequence alignments.ProtTrans is a transformer-based auto-encoder named T5 (Raffel et al., 2020) pre-trained with the BERT's denoising objective (Kenton and Toutanova, 2019), essentially learning to predict the masked amino acids.Concretely, ProtTrans contains 3B parameters, which was first trained on BFD (Steinegger et al., 2019) and then fine-tuned on UniRef50.We extracted the output from the last ProtTrans encoder layer as sequence representations, containing a 1024-dimensional vector for each residue.The inference cost of ProtTrans is extremely low, and the embedding extraction process for our whole benchmark datasets can be done within 5 min on an NVIDIA A100 GPU.The feature values in the sequence embeddings are further normalized to scores between 0 and 1 as follows: where v is the original feature value, and v min and vmax are the minimum and maximum values of this feature type observed in the training set, respectively.In addition, we also calculated two structural properties from the predicted structures using DSSP: (i) Relative solvent accessibility (RSA), which is the normalized solvent accessible surface area (ASA) by the maximum ASA of the corresponding amino acid type.(ii) One-hot secondary structure profile representing one of the eight secondary structure states.

The architecture of GPSite
The overall architecture of GPSite is shown in Figure 1.First, the protein sequence is input to the pre-trained language model ProtTrans and the folding model ESMFold to generate the sequence embedding and predicted structure, respectively.Second, a protein radius graph is constructed from the structure, where residues constitute the nodes and adjacent nodes (distance between C α <15 Å) are connected by edges.In addition to the pre-computed residue features (ProtTrans embedding and structural properties by DSSP), a comprehensive, end-to-end geometric featurizer is employed to extract the geometric node features including distance, direction and angle, as well as geometric edge features between residues including distance, direction and orientation.Third, the resulting geometric-aware attributed graph is input to a shared GNN with message passing, edge update and global node update, to capture the common binding-relevant characteristics among different molecules.Finally, 10 ligand-specific MLPs are adopted to learn the binding patterns of particular molecules in a multi-task manner.

The geometric featurizer
GPSite represents the protein as a radius graph derived from the C α coordinates of the residues, where the radius is equal to 15 Å.An end-to-end featurizer is utilized to act directly on the atomic coordinate matrix X for geometric feature extraction similar to Gao et al., 2022, except that we additionally encode the sidechain conformations of the residues.In this representation, a local coordinate system is first defined at each residue based on the relative position of the C α atom to other backbone atoms.Then, several geometric node and edge features are derived to capture the arrangements of backbone and sidechain atoms in or between residues.(i) Local coordinate system.We define a local coordinate system for residue i , where b i is the negative bisector of the angle formed by the N, C α , and C atoms, and n i is a unit vector normal to this plane.Formally, we have: Based on the local coordinate systems, we could construct geometric features that are invariant to rotation and translation for single or pair of residues.
(ii) Geometric node features.GPSite constructs distance, direction and angle features for each residue.Given the coordinates of two atoms A and B , the distance feature is computed via is a radial basis function.For the intra-residue distance features of node Here, R denotes the centroid of the heavy sidechain atoms.The direction features encoding relative directions of other inner atoms to C α in residue i are computed via As shown in Figure 1, we also incorporate the sine and cosine values of the bond angles to consider the backbone geometry.
(iii) Geometric edge features.Similarly, we construct geometric features between neighboring residues including distance, direction and orientation.The inter-residue distance features RBF ( ∥A − B∥ ) between nodes i and j are computed with atoms To reflect the relative spatial rotation between the two reference frames of residues i and j , the orientation feature q is the quaternion encoding function representing 3D rotation matrices as four-element vectors (Huynh, 2009).

The edge-enhanced graph neural network
The above-mentioned attributed graph with features from ProtTrans, DSSP and the geometric featurizer is input to several GNN layers with message passing, edge update and global node update modules, to learn the residue representations by considering multi-scale interactions in node, edge, and global context levels.
(i) Message passing with graph transformer.Since transformer is well-acknowledged as the most powerful network in modeling sequence and graph data (Yuan et al., 2022c;Ingraham et al., 2019;Zheng et al., 2020), we adopt its multi-head attention mechanism while taking the edge features into account for message passing in graphs.Formally, we denote the hidden feature vectors of node i and edge j → i in layer l as h l i and e l ji , respectively.Before the first GNN layer, we apply an MLP to project the initial node and edge features to the d -dimensional space.To update node i , the message passing in layer l is performed as follows: where the attention coefficient α l ji from node j to i is calculated by: The learnable weight matrices W l Q , W l K and W l V are used to project the node feature vectors into the corresponding query, key and value representations.W l E is used to transform the edge features which will be subsequently added to the key and value representations.N ( i ) denotes the neighbors of node i .In practice, we use multi-head attention to linearly project the queries, keys and values multiple times, perform the attention function in parallel and finally concatenate them together.
(ii) Edge update.To improve the model's capability, we update the features of an edge using its connecting nodes: where || denotes the vector concatenation and EdgeMLP is an MLP for edge update.
(iii) Node update with global context attention.While the local node and edge interactions play crucial roles in learning residue representations, the global information is also valuable for improving accuracy.However, global self-attention across the whole protein is computationally intensive.Alternatively, here we learn a global context vector for each protein and use it to apply gated attention for the node representations similar to Gao et al., 2022: where n is the number of residues in a protein, GateMLP is an MLP for gated attention, σ ( • ) is the sigmoid function, and ⊙ denotes the element-wise product operation.

Multi-task learning
To better capture the intrinsic similarities of binding patterns among different ligands and enable efficient predictions in a concurrent fashion, GPSite employs a multi-task framework, where the shared edge-enhanced GNN is used to model the common binding-relevant characteristics, followed by 10 ligand-specific MLPs to mine the binding patterns of particular molecules.In the training steps, different types of ligand-binding proteins are input to the same network, and predictions for the 10 types of ligands are yielded.Nonetheless, only predictions with the corresponding known ligandbinding sites are used to calculate loss and perform backpropagation, while the predictions of other ligands without ground truth data are masked.That is, each protein is used to train the shared GNN and the corresponding ligand-specific MLP(s) of its known binding partner(s) without affecting other irrelevant MLP(s).

Implementation and evaluation
We performed five-fold cross-validation on the training data, where the 10 training sets were mixed and split into five folds randomly, and then each time a model was trained on four folds and evaluated on the remaining fold.This process was repeated for five times and the average validation performance was used to optimize the hyperparameters of the network.In the test phase, all five trained models from cross-validation were used to make predictions, which were averaged as the final prediction of GPSite.Specifically, we adopted Pytorch 1.13.1 (Paszke et al., 2019) to implement GPSite, which contains a four-layer shared GNN with 128 hidden units and four attention heads.The Adam optimizer (Kingma and Ba, 2014) with the one-cycle learning rate policy (Smith and Topin, 2019) was used for model optimization on the binary cross entropy loss.Within each epoch, we randomly drew 25,000 samples from the training data with replacement to train our model using a batch size of 16.The training process lasted at most 25 epochs and we performed early stopping based on the validation performance, which took ~1.5 hr on an NVIDIA A100 GPU.Similar to the previous works, we use recall, precision, accuracy, F1-score, Matthews correlation coefficient (MCC), AUC, and AUPR to evaluate the prediction performance, whose detailed definitions are given in Evaluation metrics.AUC and AUPR are independent of thresholds, thus reflecting the overall performance of a model.The other metrics are calculated by converting the predicted binding probabilities to binary predictions with a threshold for each ligand, which is determined by maximizing MCC on the validation sets.We adopted AUPR for hyperparameter selections as it is more sensitive and informative than AUC in imbalanced two-class classification tasks (Saito and Rehmsmeier, 2015).

High-throughput annotation and analysis on Swiss-Prot
We downloaded all the available 569,516 sequences in Swiss-Prot (release: 2023-05-03) and then removed sequences longer than 2700 residues (0.21%) due to the memory limit of GPUs, resulting in a total of 568,326 sequences.Non-standard amino acids in these sequences were also removed.ESMFold was applied to predict the structures from sequences on 16 NVIDIA A100 (80 GB) GPUs, which cost ~8.5 days.The structure preprocessing and feature extraction (by ProtTrans and DSSP) procedures overall cost ~4 hr on the same GPU cluster, and the inference of binding sites using GPSite took ~1 hr.
For the downstream analyses of protein function and variant, we only considered the predicted protein structures with length ≥50 and pTM >0.7 evaluated by ESMFold, consisting of 370,140 structures eventually.Interface residues are defined as residues with predicted binding probabilities higher than the pre-defined thresholds described in implementation and evaluation.Surface residues on the predicted structures are defined as the residues with RSA >5% (Jones and Thornton, 1997) computed by DSSP.The annotated GO terms of molecular function and biological process for all sequences were downloaded from UniProt (UniProt Consortium, 2023), and we up-propagated the annotations using all types of relationships defined in the hierarchical structure of GO (release: 2023-05-10).The binding proteins of a specific ligand were determined as those annotated with the corresponding ligand-binding molecular function, and the non-binding proteins were randomly sampled to the same number as binding proteins.Concretely, we collected 21680, 42074, 1240, 24108, 74428, 4960, 15030, 2088, 24161, and 4093 binding proteins from Swiss-Prot for DNA, RNA, peptide, protein, ATP, HEM, Zn 2+ , Ca 2+ , Mg 2+ , and Mn 2+ respectively.The pathogenicity annotations of human protein altering variants were downloaded from UniProt (release: 2023_02), which contain UniProt manually reviewed natural variants, as well as variants imported from other public resources such as Ensembl Variation (Martin et al., 2023) and ClinVar (Landrum et al., 2018), and the conflicting annotations were removed.
GeoBind (Li and Liu, 2023) GeoBind is a structure-based nucleic-acid-and ligand-binding site predictor, which employs geodesic convolution to point cloud on the protein surface.Its input features contain HMM, atom type, and local curvature of the surface.We used its webserver (http://www.zpliulab.cn/GeoBind/)for evaluation.
GraphSite (Yuan et al., 2022b) GraphSite is a sequence-based DNA-binding site predictor, which adopts a graph transformer for massage passing on protein residue graphs constructed from AlphaFold2-predicted structures.Its input features contain sequence representations from AlphaFold2, PSSM, HMM, and structural properties (RSA, SS and torsion angles) from DSSP.We used its standalone program with pre-trained model weights from https://github.com/biomed-AI/GraphSite(Yuan, 2023a), for inference and evaluation.
aaRNA (Li et al., 2014) aaRNA is a structure-based RNA-binding site predictor, which employs a fully connected neural network input with sequence features based on PSSM and HMM, as well as structural features like RSA, SS, and curvature of the protein surface.We used its webserver (https://sysimm.ifrec.osaka-u.ac.jp/aarna/) for evaluation.
PepBind (Zhao et al., 2018) PepBind is a sequence-based peptide-binding site predictor, which combines the predictions from a SVM-based ab-initio method SVMpep and two template-based methods S-SITE and TM-SITE (Yang et al., 2013).SVMnuc was trained with sequence features from PSSM, HMM, predicted SS and predicted intrinsic disorder.The structures required by TM-SITE are predicted by the I-TASSER Suite (Yang et al., 2015).We used its webserver (http://yanglab.nankai.edu.cn/PepBind/) for evaluation.
TargetS (Yu et al., 2013) TargetS is a sequence-based ligand-binding site predictor, which extracts evolutionary information (PSSM), predicted SS and ligand-specific binding propensity from sequence context using a slidingwindow strategy.It then employs several SVMs to learn the local binding patterns, which are assembled by the modified AdaBoost algorithm.We used its webserver (http://www.csbio.sjtu.edu.cn/TargetS/) for evaluation.
DELIA (Xia et al., 2020) DELIA is a structure-based ligand-binding site predictor, which uses the bidirectional long short-term memory (BiLSTM) networks to refine sequence features including binding propensity from S-SITE, PSSM, HMM, predicted SS and predicted RSA, as well as a CNN to extract characteristics from the protein distance matrices.We used its webserver (http://www.csbio.sjtu.edu.cn/bioinf/delia/) for evaluation.
MIB (Lin et al., 2016) MIB is a template-based metal ion-binding site predictor, where the fragment transformation method is used for structural comparison between query proteins and templates without any data training.The predictions of MIB for the test proteins are directly obtained from our previous work (Yuan et al., 2022c), which were originally generated from the MIB webserver (http://bioinfo.cmu.edu.tw/MIB/).
IonCom (Hu et al., 2016) IonCom is a structure-based metal and acid radical ion-binding site predictor, which combines the predictions from an SVM-based ab-initio method IonSeq and four template-based methods including COFACTOR (Roy et al., 2012), TM-SITE, S-SITE, and COACH (Yang et al., 2013).IonSeq was trained with sequence features from PSSM, ligand-specific binding propensity, predicted SS, predicted RSA, etc.We used its standalone program with pre-trained weights from https:// zhanggroup.org/IonCom/ for inference and evaluation.
LMetalSite (Yuan et al., 2022c) LMetalSite is a sequence-based alignment-free metal ion-binding site predictor where the pre-trained protein language model ProtTrans is used to extract sequence embeddings and a transformer with multi-task learning is applied to capture the intrinsic similarities between different metal ions.The prediction results of the test proteins can be obtained from our webserver (http://bio-web1.nsccgz.cn/app/lmetalsite)or the standalone program (https://github.com/biomed-AI/LMetalSite,Yuan, 2022a).

Performance comparison between GPSite and PeSTo
Since 340 out of 375 proteins in our protein-protein binding site test set share >30% identity with the training sequences of PeSTo, we performed a separate comparison between GPSite and PeSTo using the training and test datasets from PeSTo.By re-training with simply the same hyperparameters, GPSite achieves better performance than PeSTo (AUPR of 0.824 against 0.797) as shown in Appendix 2-table 4. Furthermore, when using ESMFold-predicted structures as input, the performance of PeSTo decreases substantially (AUPR of 0.691), and the superiority of our method will be further reflected.As in Krapp et al., 2023b, the performance of ScanNet is also included (AUPR of 0.720), which is also largely outperformed by GPSite.

Case study for the ribosome biogenesis protein ERB1
Here we present an example of an RNA-binding protein, i.e., the ribosome biogenesis protein ERB1 (PDB: 7R6Q, chain m), to illustrate the impact of predicted structure's quality.As shown in Appendix 3-figure 5, ERB1 is an integral component of a large multimer structure comprising protein and RNA chains (i.e. the state E2 nucleolar 60S ribosome biogenesis intermediate).Likely due to the neglect of interactions from other protein chains, ESMFold fails to predict the correct conformation of the ERB1 chain (TM-score=0.24).Using this incorrect predicted structure, GPSite achieves an AUPR of 0.580, lower than GraphBind input with the native structure (AUPR=0.636).However, the performance of GraphBind substantially declines to an AUPR of 0.468 when employing the predicted structure as input.Moreover, if GPSite adopts the native structure for prediction, a notable performance boost can be obtained (AUPR=0.681).

Generation of the evolutionary features from MSA
Evolutionarily conserved residues may contain motifs related to important protein properties.Here, we also evaluated the widely used evolutionary features from MSA in our ablation studies, including position-specific scoring matrix (PSSM) and hidden Markov models (HMM) profile.PSSM is produced by running PSI-BLAST (Altschul et al., 1997) to search the query sequence against UniRef90 (Suzek et al., 2007) with three iterations and an E-value of 0.001.HMM profile is generated by running HHblits (Remmert et al., 2011) against UniClust30 (Mirdita et al., 2017) with default parameters.Each residue is encoded into a 20-dimensional vector in PSSM or HMM.The feature values in the sequence representations from PSSM and HMM are further normalized to scores between 0 and 1 as follows: where v is the original feature value, and v min and vmax are the minimum and maximum values of this feature type observed in the training set, respectively.

The effect of training with predicted structures
We examined the performance under different training and evaluation settings as shown in Appendix 2-table 9.As expected, the model yields exceptional performance (average AUPR of 0.656) when trained and evaluated using native structures.However, if this model is fed with predicted structures of the test proteins, the performance substantially declines to an average AUPR of 0.573.This trend aligns with the observations for other structure-based methods as illustrated in The cross-type performance of the multi-task network in GPSite We conducted cross-type evaluations by applying different ligand-specific MLPs in GPSite for the test sets of different ligands.As shown in Appendix 2-table 10, for each ligand-binding site test set, the corresponding ligand-specific network consistently achieves the best performance.This indicates that the ligand-specific MLPs have specifically learned the binding patterns of particular molecules.We also noticed that the cross-type performance is reasonable for the ligands sharing similar properties.For instance, the DNA-specific MLP exhibits a reasonable AUPR when predicting RNA-binding sites, and vice versa.Similar trends are also observed between peptide and protein, as well as among metal ions as expected.Interestingly, the cross-type performance between ATP and HEM is also acceptable, potentially attributed to their comparable molecular weights (507.2 and 616.5, respectively).

GPSite is effective for completing the function annotations in Swiss-Prot
As depicted in Figure 5A, GPSite assigns relatively high prediction scores to the proteins without 'protein binding' function in the Swiss-Prot annotations, leading to a modest AUC value of 0.608 (Figure 5B).This may be ascribed to the fact that protein-protein interactions are ubiquitous in living organisms while the Swiss-Prot function annotations are incomplete.To support this hypothesis, we present two proteins as case studies, both sharing <20% sequence identity with the proteinbinding training set of GPSite.The first case is Aminodeoxychorismate synthase component 2 from Escherichia coli (UniProt ID: P00903).GPSite confidently predicted this protein as a protein-binding protein with a high prediction score of 0.936.Notably, this protein was not annotated with the

Figure 1 .
Figure1.The overview of GPSite.The protein sequence is input to the pre-trained language model ProtTrans and the folding model ESMFold to generate the sequence embedding and predicted structure, respectively.According to the structure, a protein radius graph is constructed where residues constitute the nodes and adjacent nodes are connected by edges.In addition to the pre-computed residue features of ProtTrans embedding and DSSP structural properties, a comprehensive, end-to-end geometric featurizer is employed to extract the geometric node features including distance, direction and angle, as well as geometric edge features between residues including distance, direction and orientation.Here, the R group denotes the centroid of the heavy sidechain atoms.The resulting geometric-aware attributed graph is input to a shared GNN to perform edgeenhanced message passing for capturing the common binding-relevant characteristics among different molecules.Finally, 10 ligand-specific MLPs are adopted to learn the binding patterns of particular molecules in a multi-task manner.Examples of the applications of GPSite include binding site identification and protein-level Gene Ontology (GO;Ashburner et al., 2000) function prediction.

Figure 2 .
Figure 2. The performance of GPSite and the state-of-the-art methods.(A) The ROC and precision-recall curves of GPSite on the 10 binding site test sets.The numbers in the legends are areas under the curves.(B-C) The AUPR values of the top-performing methods in each test set.The methods marked with * denote evaluations using the ESMFold-predicted structures as input.

Figure 3 .
Figure 3.The performance of GPSite on low-quality predicted structures.(A) The performance of GPSite on structures of different qualities, and the comparisons with the best experimental structure-based methods in the test sets of DNA, RNA, and peptide.The experimental structure-based methods input with ESMFold-predicted structures are marked with *. (B) Distributions of the TM-scores between native and predicted structures in the DNA, RNA and peptide datasets.(C) The correlations between the prediction quality of ESMFold and the performance of GPSite and GraphBind on the RNA-binding site test set when TM-score <0.5.The scatters denote the average TM-score and AUPR for each bin after sorting the proteins according to the TM-scores and evenly dividing them into 20 discrete bins.The lines are fit to the original data (without binning) using linear regression.(D) The glucocorticoid receptor (GR) in complex with DNA, a coactivator peptide, and Zn 2+ ions (PDB: 7PRW).The ESMFold-predicted protein structure (gray) is superimposed to the native structure (cyan) using US-align (TM-score=0.72).The ligands are colored in orange.(E) Superposition of the native (cyan) and predicted (gray) DNA-binding domains of GR (TM-score=0.96).(F-H) The Zn 2+ , DNA and peptide binding site predictions by GPSite for the predicted GR structure in cartoon or surface view.True positives, false positives and false negatives are colored in green, red and yellow, respectively.The ligands in orange were subsequently added based on the native complex structure to show the quality of the predictions by GPSite.

Figure 4 .
Figure 4.The effects of protein features and model designs.(A) Ablation studies on sequence and structure information in the DNA, RNA, and peptide test sets.The average performance of the 10 test sets is also shown.(B) Performance comparison between GPSite and the baseline model using MSA profile for proteins with different Neff values in the combined test set of the 10 ligands.(C) Performance boosts in AUPR using GPSite compared to the single-task baseline.(D) Visualization of the distributions of residues encoded by raw feature vectors (left) or hidden embedding vectors from the pre-trained shared network in GPSite (right) for the unseen carbohydrate-binding site dataset using t-SNE.The binding and non-binding residues are colored in red and gray, respectively.(E) The performance when using the hidden embeddings from GPSite as input features to train an MLP for carbohydrate-binding site prediction, and its comparisons with other methods.

Figure 5 .
Figure 5. Analyses of Swiss-Prot based on the binding site annotations by GPSite.(A) The distributions of the binding scores assigned by GPSite for proteins with or without certain ligand-binding molecular function in GO. (B)The ROC curves when using the GPSite binding scores to distinguish between binding and non-binding proteins of various ligands.(C) The percentage of proteins predicted as binding to DNA and RNA by GPSite to be annotated with certain biological process in Swiss-Prot.Only the specific biological process terms with depth ≥8 in the GO directed acyclic graph are considered, among which the top 15 terms with the highest percentages are displayed.(D) The percentage of surface pathogenic or benign natural variant sites within GPSite-predicted interfaces.The baseline is the probability of a random surface residue being annotated as an interface residue.(E) The pathogenic probabilities of variants located in non-binding sites or different types of binding sites predicted by GPSite.

Figure 2 .
More importantly, in the practical scenario where only predicted structures are available for the target proteins, training the model with predicted structures (i.e.GPSite) results in superior performance than training the model with native structures (average AUPR of 0.594 against 0.573), probably owing to the consistency between the training and testing data.For completeness, the results in Appendix 3-figure2are also included where GPSite is tested with native structures (average AUPR of 0.637).
Note: 'Pep' and 'Pro' denote peptide and protein, respectively.The numbers in this table are AUPR values.The best/second-best result in each test set is indicated by bold/underlined font.