SmoPSI: Analysis and Prediction of Small Molecule Binding Sites Based on Protein Sequence Information

The analysis and prediction of small molecule binding sites is very important for drug discovery and drug design. The traditional experimental methods for detecting small molecule binding sites are usually expensive and time consuming, and the tools for single species small molecule research are equally inefficient. In recent years, some algorithms for predicting binding sites of protein-small molecules have been developed based on the geometric and sequence characteristics of proteins. In this paper, we have proposed SmoPSI, a classification model based on the XGBoost algorithm for predicting the binding sites of small molecules, using protein sequence information. The model achieved better results with an AUC of 0.918 and an ACC of 0.913. The experimental results demonstrate that our method achieves high performances and outperforms many existing predictors. In addition, we also analyzed the binding residues and nonbinding residues and finally found the PSSM; hydrophilicity, hydrophobicity, charge, and hydrogen bonding have obviously different effects on the binding-site predictions.


Introduction
Proteins perform the biological functions through interactions with other molecules. In most cellular processes, proteins interact with small molecules to perform their biological functions. erefore, the prediction of proteinsmall molecule binding sites is of great significance for understanding and exploring the function of proteins [1][2][3][4][5][6]. With the continuous development of molecular pathology, people have a deeper understanding of the molecular mechanism of disease occurrence and drug efficacy. e study of the interaction between drug molecules and protein targets is the basis of the drug design process. In recent years, many calculation methods have been proposed for the problem of drug molecule and protein binding sites. ese methods are fast and inexpensive compared to traditional biochemical experiments. e identification methods for binding sites were mainly classified into the following categories. e purely geometric-based approach follows the assumption that the protein-small molecule binding site is usually located in the gap of the protein surface or the pores of the protein. For example, the CASTp algorithm can locate and measure the 3-dimensional structure of protein pockets [7]. Fpocket is a protein surface pocket and space detection package based on Voronoi tessellation and alpha spheres. It enables fast and efficient protein pocket detection and efficient pocket descriptor extraction [8]. PocketPicker is a grid-based automation technology for predicting protein binding pockets and specifying the shape and burial of potential binding sites [9]. e SITEHOUND algorithm identifies potential ligand-binding sites and regions characterized by favorable nonbonded interactions with a chemical probe [10]. Based on sequence data methods, protein sequences can be used to identify binding sites for ligand protein molecules. ATPsite combines secondary structure, solvent accessibility, and dihedral angles based on evolutionary information to construct the SVM classifier to predict the ATP binding residue [11]. WideDTA is a deep learning-based prediction model that employs protein sequence to drug-target binding affinity [12]. TargetS extracts three characteristics of evolutionary information, secondary structure, and ligand-specific binding propensity features. Based on these features, an Adaboost classifier scheme is proposed [13]. Yu et al. also proposed a classification scheme based on the PSSM and secondary structure of protein sequences [14][15][16]. In order to predict binding residues, TargetATPsite proposes a novel image sparse representation technique for encoding PSSMs of sequence data [17]. In addition, Q-SiteFinder utilizes the interaction energy between the protein and the van der Waals probe of the ligand to locate favorable binding sites [18]. Dai proposed a solution that not only uses PSSM features based on sequence data but also combines methods based on geometric cavity recognition.
Inspired by the previous work, in our study, we present a new SmoPSI method for predicting the binding sites of small molecules to protein targets based on protein sequence. e SmoPSI model was obtained by training the binding residue and nonbinding residue using the XGboost classifier which is an integrated learning algorithm based on gradient boosting. e SmoPSI method can predict binding domain residues based on the physicochemical properties and evolutionary information of extracting protein sequence residues. e PSSM of protein sequences has been used to predict binding sites, because whether the residues in the sequence are mutated is determined by many factors in the evolution process, and these factors also affect the binding of proteins to other protein-small molecules [19,20]. e physicochemical properties of the binding domain residues and the nonbinding domain residues are also different. For example, the binding domain of protein and DNA molecules exhibits a higher single negative charge characteristic [21][22][23][24][25]. erefore, we analyzed the hydrophobic, hydrophilic, charge, and hydrogen bond properties of the binding domain residues and nonbinding domain residues on the protein sequence. We also present the type information of the amino acid residues in the feature matrix in the form of a one-hot code and analyze the distribution of different residue types between the binding residues and the nonbinding residues. Based on the above features, we used the XGBoost algorithm for classification and prediction. e full name of XGBoost is eXtreme Gradient Boosting, which is an integrated learning algorithm based on gradient boosting [26]. Finally, we trained and verified 14 classes of protein-small molecules. e model SmoPSI achieved better results with an AUC of 0.918 and an ACC of 0.913 and obtained a more significant classification effect.

Datasets.
In the experiment, we obtained the dataset of protein-small molecules from the SC-PDB database (http:// bioinfopharma.ustrasbg.fr/scPDB/) [27]. As of August 2018, 16034 entries, 4782 proteins and 6326 ligands, have been released on the SC-PDB website. We removed redundant data using the PISCES program (http://dunbrack.fccc.edu/ Guoli/PISCES.php) after downloading all of the proteinsmall molecules. After removal of redundancy, 5090 proteinsmall molecules were screened with sequence identity less than 30% and a minimum chain length of 40 amino acid residues [28]. e realization of protein function needs to recognize the binding residue of the ligand molecule in the protein by binding to the ligand, which will provide important help for protein function research and drug development. Here, if the distance between the central atom of residues and any one of the ligand molecules is less than 3.5Å, this residue is a binding residue. Conversely, it is a nonbinding residue. Finally, we chose 14 types of proteinsmall molecules that are large enough to be useful for testing, as shown in Table 1. e number of protein sequences to each protein-small molecule and the number of bound residues and nonbinding residues in the sequence are shown, respectively. However, because of the large difference in the number of binding residues and nonbinding residues, there is a significant data imbalance problem in the dataset. So, in the course of the experiment, we used a downsampled method to repeatedly down sample the nonbinding residues during the training. We found through experiments the downsampled method can solve the problem of data imbalance.

PSSM.
e conservatism in the evolution of residues at the various positions of a protein sequence can be reflected by the evolutionary information of the protein. e evolutionary information of proteins is obtained by multisequence comparison of homologous sequences. Because the evolutionary information of proteins base on all members of the homologous family, it can fully respond to distant homologous relationships [29]. PSI-BLAST is a position-specific iterative basic local alignment search tool. We used it to scan the NCBI database with an e value of 0.001 and a number of iterations of 3, resulting in a L × 20 positionspecific scoring matrix (PSSM) of the protein sequence (L is the length of the sequence).

Physicochemical Properties.
Hydrophilicity characterizes the amount of protein residues that are compatible with water, and hydrophobicity indicates the ability to repel water. Hopp gave a hydrophobicity scale for each of the 20 amino acids with a fixed value (AAindex ID: HOPT810101). Patrick gave a hydrophilicity scale for each of the 20 amino acids with a fixed value (AAindex ID: HOPT810101). It is generally believed that amino acids having hydrophobicity tend to be oriented toward the interior of the protein structure, while amino acids having hydrophilicity tend to face the surface in the three-dimensional structure of the protein [30]. erefore, hydrophilic amino acids have a greater probability of interacting with small molecules. Such features help distinguish between the binding residues and the nonbinding residues. e electrostatic charge of the binding region of a protein to a small molecule is one of the most influential properties of protein-ligand interactions, which has been confirmed by many studies [31]. Electrostatic complementation contributes to the nonspecific binding of proteins to small molecules [32]. Among the 20 standard amino acids, Arg, His, and Lys are positively charged, and Asp and Glu are negatively charged. We believe that the electrostatic charge value of a protein residue is helpful in distinguishing between binding residues and nonbinding residues.
Hydrogen bonds are a kind of interaction force that is slightly stronger than the intermolecular force and slightly weaker than the covalent bond and the ionic bond. e hydrogen bond plays an important role in the interaction between the protein and the small molecules [33]. e interaction of a protein with small molecules may depend primarily on the hydrogen bonding ability of the atoms in the side chain of the amino acid. erefore, the number of hydrogen bonds contained in an amino acid is also one of its important physical and chemical properties. e hydrogen bonding properties of amino acids can be obtained from the AAindex database (AAindex ID: FAUJ880109) of the 20 amino acids, and 11 amino acids can form hydrogen bonds through their side chains.

Sliding Window Sampling.
For predicting the binding residues of proteins to small molecules, extracting information from sequences and constructing feature vectors is a very critical step. Since the data used in the experiment are sequence data, the binding residues and the nonbinding residues are continuously distributed, so we sought to create a sliding sampling window on the protein sequence. As shown in Figure 1, the sampling window contains the binding residues and (w − 1)/2 adjacent residues on both sides of the residue. We used eight different initial values as an alternative to the window length w [33]. If the residue in the center of the window is a binding residue, then we consider the matrix sampled by the window to be a positive set. A matrix sampled by a window is considered to be a negative set if the residue at the center of the window is a nonbinding residue. We believed that the binding residues are also affected by the environment surrounding the binding site residues.

Classification Model.
Tree boosting is an efficient and widely used machine learning method [26]. We use an extensible end-to-end tree boosting system XGBoost, which is widely used by scientists to achieve the most advanced results in many machine learning challenges. Gradient boosting is an improvement on the basis of boosting. e idea of the algorithm is to continuously reduce the residuals and further reduce the residual of the previous model in the gradient direction to obtain a new model. XGBoost is just an improved version of the Gradient Boosting algorithm. XGBoost generates a second-order Taylor expansion for the loss function and obtains the optimal solution for the regular term outside the loss function, making full use of the parallel computing advantages of the multicore CPU to improve the accuracy and speed. , ese metrics can be measured by the numbers of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) for each classifier, where TP is the number of proteins correctly predicted as binding sites, FP is the number of proteins incorrectly predicted as binding sites, TN is the number of proteins correctly predicted as no binding sites, and FN is the number of proteins incorrectly predicted as no binding sites. e area under the receiver operating characteristic (AUC) is an evaluation method for a predictor in a binary classification system.

Distribution of Amino Acid Classes.
We collected the molecular sequence of 6592 protein-small molecules and analyzed the type and distribution ratio of binding and nonbinding residues in all protein sequence data, as shown in Figure 2. For nonbinding residues, Ala, Leu, Gly, Val, and Glu are more distributed. In addition, we can also find that the distribution differences of Gly, Glu, Cys, and Lys are statistically significant for the binding residues and nonbinding residues. Differences in the distribution of these classes of amino acids may be related to the function of binding residues to nonbinding residues.

Analysis of Physicochemical Properties of Protein
Binding Regions. In addition to analyzing the distribution of each type of amino acid, we calculated the hydrophilicity, hydrophobicity, electrostatic charge, and hydrogen bond distribution of the protein binding domain and the nonbinding domain, as shown in Figures 3(a) and 3(b). e hydrophilicity and hydrophobicity are significantly different between the bound and unbound regions. e likely cause is because the binding domain residues are mostly distributed on the surface of the protein, the binding domain residues are more likely to contact with water molecules [30]. e hydrophilicity value and the hydrophobicity value represent the tendency to be hydrophilic and hydrophobic. As shown in Figure 3(c), we have found that the binding region exhibits a hydrophilic character and the nonbinding region exhibits a hydrophobic character. However, the electrostatic distribution of the nonbonded regions is still smaller than the bonded regions. en, we think that electrostatic complementation helps the binding of proteins to small molecules [34,35]. Although previous studies have found that hydrogen bond plays an important role in ligand and protein binding, we have no statistically significant difference in the distribution of hydrogen bonds between binding and unbinding regions [36,37]. Based on the aforementioned findings, hydrophilicity, hydrophobicity, and electrostatic charge of amino acids help us to construct a classification model for the binding residues and nonbinding residues.   Table 1, including three types of nucleotides (ATP, ADP, and GDP). e receptor for atrial natriuretic peptide (ANP) is a type-I transmembrane protein containing an extracellular ligand-binding domain, a single transmembrane sequence, an intracellular kinase-homologous domain, and a guanylate cyclase (GCase) domain [38]. ADP is the cognate ligand for the orphan G protein-coupled receptor [39].

Determination of the Length of the Sampling Window.
e width of the sampling window in data processing directly determines the quality of the dataset and the experimental effect of the classification. erefore, we tested the alternative 8 window lengths on the 14 molecular sets. We used the downsampling method for the phenomenon that the number of positive and negative datasets is different and then carried out 10-fold cross-validation. We selected two indicators AUC and ACC as the criteria for window width selection. e relationship between AUC and window width is shown in Figure 4(a). When the window is scaled up to 3, the AUC shows an upward trend. If w � 15, the AUC values of all types of datasets reach maximum values. en, the value of AUC begins to decrease. e docking on the protein surface is not only related to the residue itself but also related to the surrounding environment. When the window length is small, the sampled digital matrix does not effectively represent the surrounding features.
erefore, values of AUC and ACC are small. Due to the small size of small molecules, the number of bound residues in one sequence is not large. When the sampling length is too long, the sampled digital matrix will contain a lot of interference information, resulting in a small value of AUC and ACC too. e same result can also be observed in Figure 4(b). Because the result of ACC reaches a maximum at w � 15, we choose the window width w to be 15 residues.

Significance Testing of the Features.
In the previous discussion, we introduced and used five features of proteins: PSSM, hydrophilic, hydrophobic, electrostatic charge, and hydrogen bonding. To analyze the significance of different features, we used the mean decrease accuracy and random forest-based feature importance scoring system to measure the effect of each feature on the predictive model. e basic idea of the algorithm is to change a certain feature value to a random number and observe how much the accuracy of the model is reduced. For unimportant features, this method has little effect on the accuracy of the model, but it will greatly reduce the accuracy of the model for important features, as shown in Figure 5. e PSSM has the highest score for the features, followed by hydrophilicity, hydrophobicity, and charge. Hydrogen bonds have the least impact on the classification model. e above results are also consistent with the previous analysis of the protein binding region.

Performance of the Classification Model.
We constructed these feature vectors with amino acid classes, PSSMs, hydrophilicity, hydrophobicity, and charge, while the length of the window is set at the length of 15 residues. en, 14 kinds of protein-small molecules were classified and predicted. e results are shown in Table 2. We found that the AUC and ACC results for the ATP type of protein-small molecules are highest at 0.935 and 0.927, respectively. e average AUC and ACC of all 14 species of protein-small molecules are 0.918 and 0.913, respectively. e strategy of SmoPSI is useful to improve the performance of prediction.

Comparison with Existing Predictors.
Our method is compared with EC-RUS and TargetS, and results on the training sets are listed in Table 3. e recall values of our method are 0.934, 0.940, and 0.940 on ADP, ATP, and GDP, respectively. TargetS [13] achieves recall values of 0.561, 0.484, and 0.639 on ADP, ATP, and GDP, respectively. EC-RUS [40] achieves recall values of 0.561, 0.484, and 0.639 on ADP, ATP, and GDP, respectively. Obviously, the performance of the proposed method is better than that of TargetS and EC-RUS. In addition, our method achieves an average AUC value of 0.933, which is better than the average AUC values of TargetS and EC-RUS.

Conclusion
In previous studies of sequence-based binding site prediction, various methods did not combine PSSM features with the physicochemical properties of proteins. A different method was used in our experiments. First, the differences between the four types of physicochemical properties of hydrophilicity, hydrophobicity, electrostatic charge, and hydrogen bonding in the bound and unbound regions were analyzed. We have found that the residues in the binding region of the small molecule exhibit a hydrophilic character, and the charge characteristics of the binding region are also stronger than those of the nonbinding region. e distribution of hydrogen bonds in the two regions is never significantly greater than others, which can also be observed from the results of the feature significant analysis. At the same time, the distribution of different kinds of amino acids was also analyzed. We concluded that the distribution of five residues such as ALA and LEU was more than that of other   amino acids. e distribution of the four residues of GLY, GLU, CYS, and LYS in the binding region and the nonbinding region is significant. We constructed the feature vectors using the aforementioned features and the PSSM. e SmoPSI model was constructed using the XGBoost algorithm and tested on 15 distinct types of protein-small molecules. e average AUC value and ACC value obtained by the 10-fold cross-validation were 0.918 and 0.913, respectively. e SmoPSI method can help to understand the binding mechanism between proteins and small molecules and explore the characteristics of small molecules combined with proteins.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no commercial or associative conflicts of interest in connection with the work submitted.