Identifying Heat Shock Protein Families from Imbalanced Data by Using Combined Features

Heat shock proteins (HSPs) are ubiquitous in living organisms. HSPs are an essential component for cell growth and survival; the main function of HSPs is controlling the folding and unfolding process of proteins. According to molecular function and mass, HSPs are categorized into six different families: HSP20 (small HSPS), HSP40 (J-proteins), HSP60, HSP70, HSP90, and HSP100. In this paper, improved methods for HSP prediction are proposed—the split amino acid composition (SAAC), the dipeptide composition (DC), the conjoint triad feature (CTF), and the pseudoaverage chemical shift (PseACS) were selected to predict the HSPs with a support vector machine (SVM). In order to overcome the imbalance data classification problems, the syntactic minority oversampling technique (SMOTE) was used to balance the dataset. The overall accuracy was 99.72% with a balanced dataset in the jackknife test by using the optimized combination feature SAAC+DC+CTF+PseACS, which was 4.81% higher than the imbalanced dataset with the same combination feature. The Sn, Sp, Acc, and MCC of HSP families in our predictive model were higher than those in existing methods. This improved method may be helpful for protein function prediction.


Introduction
Heat shock proteins (HSPs) are ubiquitous in living organisms. They act as molecular chaperones by facilitating and maintaining proper protein structure and function [1][2][3][4]; in addition, they are involved in various cellular processes such as protein assembly, secretion, transportation, and protein degradation [5,6]. HSPs are rapidly expressed when the cells are exposed to physiological and environmental conditions such as elevated temperature, infection, and inflammation [7,8]. Since the HSPs were discovered in 1962 by Ritossa [9], the HSPs have been widely studied, including their involvement in cardiovascular disease, diabetes, cancer [10][11][12][13][14]. According to molecular function and mass, HSPs are categorized into six different families: HSP20 (small HSPS), HSP40 (J-protein), HSP60, HSP70, HSP90, and HSP100 [15]. These families of HSPs have different functions. The HSP20 family is an ATP-independent molecular chaperone. They are efficient in preventing irreversible aggregation processes by binding denatured proteins [16]. The HSP70 family is the most highly conserved among the HSP families; it is an ATP-dependent molecular chaperone that involves protein folding and remodeling [17]. HSP40 is the cochaperone of HSP70, which participates in DNA binding, protein degradation, intracellular signal transduction, exocytosis, endocytosis, viral infection, apoptosis, and heat shock sensing [18]. HSP90 is another ATP-dependent chaperone that controls protein function and activity by facilitating protein folding, binding of ligands to their receptors or targets, or the assembly of multiprotein complexes [19]. The function of the HSP100 protein is to improve the tolerance to temperature and to promote the proteolysis of specific cellular substrates and regulation of transcription [20]. Experimental determination of HSPs are time-consuming and laborious, so it is necessary to use an effective method to predict HSPs. Recently, some computational methods for predicting HSPs have been proposed in the literature. Feng et al. developed a predictor called "iHSP-RAAAC" that selected the reduced amino acid alphabet (RAAA) as a feature vector; the overall predictive accuracy was 87.42% with the jackknife test [21]. Ahmad et al. used the split amino acid composition (SAAC), the dipeptide composition (DC), and PseAAC [22,23] to identify HSPs; the highest overall predictive accuracy was 90.7% with the jackknife test [24]. Kumar et al. predicted HSPs and non-HSPs, and the best prediction accuracy was 72.98% by using the dipeptide composition (DC) with a 5fold cross-validation test [25]. Meher et al. used the G-Spaced Amino Acid Pair Composition (GPC) to predict HSPs; a better result was obtained with the jackknife test [26]. Chen et al. summarized the recent advances in machine learning methods for predicting HSPs [27]. Feature selection is generally essential in a classification, and the appropriate integrated feature model generally offers higher accuracy [28]. Hence, the hybrid features have been successfully used in recent studies for constructing classifiers [29,30]. We used the hybrid features to enhance performance. In this paper, the split amino acid composition (SAAC), the dipeptide composition (DC), the conjoint triad feature (CTF), and the pseudoaverage chemical shift (PseACS) were used to predict the HSPs with the same datasets as investigated by Feng et al. Data imbalance is always considered a problem in developing efficient and reliable prediction systems; due to an imbalanced dataset, the classifier would tend towards the majority class. Here, the syntactic minority oversampling technique (SMOTE) was used to solve the problem of imbalance. The overall accuracy was 99.72% with a balanced dataset in the jackknife test by using the optimized combination feature SAAC+DC+CTF+PseACS, which was 4.81% higher than the imbalanced dataset with the same combination feature.

Material and Methods
2.1. Dataset. The benchmark dataset was generated by Feng et al. [21]; the dataset was originally taken from the HSPIR database. In order to reduce homologous bias and redundancy, the program CD-HIT [31] was used to remove those sequences that have ≥40% pairwise sequence identity. 2225 sequences were obtained from different HSP families: the subset S 1 contains 357 sequences, the subset S 2 contains 1279 sequences, the subset S 3 contains 163 sequences, the subset S 4 contains 283 sequences, the subset S 5 contains 58 sequences, and the subset S 6 contains 85 sequences (see Table 1). The dataset can be freely downloaded from http://lin-group.cn/server/iHSP-PseRAAAC. The independent datasets include two datasets: the HGNC dataset and the RICE dataset (see Table 2). The HGNC dataset [32] has 96 human HSPs, and the RICE dataset has 55 RICE HSPs, which obtained 31 HSPs from Wang et al. [33] and 24 HSPs from a single family from Sarkar et al. [34]. The independent dataset can be freely downloaded from http://cabgrid.res .in:8080/ir-hsp.

The Prediction Model Construction
Overview. The prediction model process is illustrated in Figure 1. The feature parameters were extracted for the HSPs. By using various information parameters, the prediction results show that better prediction results may be obtained by combining the following four information parameters: the split amino acid composition (SAAC), the dipeptide composition (DC), the conjoint triad feature (CTF), and the pseudoaverage chemical shift (PseACS). In SAAC, the protein sequence was split into the N-terminus segment and the C-terminus segment according to the golden ratio. Among the four feature parameters, the split amino acid composition (SAAC), the dipeptide composition (DC), and the conjoint triad feature (CTF) are based on the protein sequence, while the pseudoaverage chemical shift (PseACS) is related to the protein secondary structure. Therefore, the feature parameters involved both sequence and structure information. The four feature parameters were combined, and the syntactic minority oversampling technique (SMOTE) was used to solve the problem of the imbalance dataset. The overall accuracy (OA) was 99.72% with the balanced dataset, and the result demonstrates that the proposed method is superior to the existing methods.

Feature Extraction Techniques.
In order to predict the HSPs, it is very important to choose a classifier and a set of reasonable parameters. In this paper, the split amino acid composition (SAAC), the dipeptide composition (DC) [35], the conjoint triad feature (CTF), and the pseudoaverage chemical shift (PseACS) were used to predict the HSPs.

Split Amino Acid Composition (SAAC).
Split amino acid composition (SAAC) is a feature extraction method based on AAC. In SAAC, the protein sequence is split into various segments; then, the composition of each segment is counted separately [36][37][38][39]. It is well known that the golden ratio is ubiquitous in nature. According to the golden ratio, the protein sequence is divided into the N-terminus segment and the C-terminus segment; the ratio of the N-terminus  2 Computational and Mathematical Methods in Medicine segment to the C-terminus segment is the golden ratio [40]. This method can be represented as follows: where Gr 1 is the 1-step segmentation using the golden ratio, N represents the N-terminus, C represents the C-terminus, W i is the occurrence of amino acid i, L N is the length of the N-terminus segment, L C is the length of the C-terminus segment.
With this method, we can get SAAC Gr 2 , SAAC Gr 3 , …. . Dipeptide composition (DC) is a discrete method using sequence neighbor information [27,41,42]. The occurrence frequency of each two adjacent amino acid residue was computed; the advantage of DC is that it considers some sequence-order information. It can be calculated as follows: where m i is the occurrence number of the ith dipeptide in the protein sequence, L is the length of the protein sequence.

Conjoint Triad Feature (CTF).
The conjoint triad feature (CTF) representation was used by Shen et al. [43]. In this method, the properties of one amino acid and its vicinal amino acids were considered. Three continuous amino acids were regarded as a unit.  3 Computational and Mathematical Methods in Medicine enzyme function [44], protein-protein interactions [45], RNA-protein interactions [46], and nuclear receptors [47]. The features of CTF can be formulated as follows: ½ , where n i is the occurrence number of each triad type of the protein sequence, L is the length of the protein sequence.

Pseudoaverage Chemical Shift (PseACS).
Nuclear magnetic resonance (NMR) plays a unique role in studying the structure of proteins because it provides information on the dynamics of the internal motion of proteins on multiple time scales [48]. Protons are sensitive to the chemical environment. The protons in different chemical environments experience slightly different magnetic fields, and they absorb different frequencies in different magnetic fields; the resonant frequencies of the various proteins in relation to a stand are called the chemical shift [49]. As important parameters are measured by nuclear magnetic resonance (NMR) spectroscopy, a chemical shift has been used as a powerful indicator of the protein structure. Several researchers revealed that the averaged chemical shift (ACS) of a particular nucleus in the protein backbone empirically correlates well to its secondary structure [50]. The PseACS web is accessible at http://202.207.14.87:8032/bioinformation/acACS/index.asp. For a protein P, each amino acid in the sequence is substituted by its averaged chemical shift, and P can be expressed as follows: where 15 N stands for nitrogen, 13 C α for alpha carbon, 1 H α for alpha hydrogen, and 1 H N for hydrogen linked with nitrogen. After, we select λ = 54 and i = 15 N, 13 C α , 1 H α , 1 H, the PseACS would be expressed as follows: 2.4. Syntactic Minority Oversampling Technique (SMOTE). As shown in Table 1, the numbers of HSP40 are about 4 times, 8 times, 5 times, 22 times, and 15 times that of HSP20, HSP60, HSP70, HSP90, and HSP100, respectively. This leads to imbalance data classification problems. In order to overcome this problem, we used the SMOTE to solve the problem of imbalance. SMOTE is an oversampling approach where the minority class is oversampled by selecting the minority class and creating new synthetic samples along the line segments connecting any or all K-Nearest Neighbors which belong to that class [51,52]. In this paper, the protein numbers of six subfamilies are in equilibrium with SMOTE.
This algorithm is implemented by the Weka software. A filter selects SMOTE when the data is loaded, and the parameters adopt the default parameters according to the number of families from small to large; the number of the remaining five families increases in turn to the number of HSP40, which is the largest number of the HSP families. In this way, SMOTE is realized.

Support Vector Machine (SVM).
The support vector machine is a machine learning algorithm, which is based on the statistical learning theory. The basic idea of SVM is to transform the input data into a high-dimensional Hilbert space and then determine the optional separating hyperplane [53,54]. The radical basis kernel function (RBF) was used to obtain the classification hyperplane with its effectiveness and speed in the training process. To handle a multiclass problem, the regulation parameter c and kernel width parameterγ were determined via the grid search method. "One-versus-one (OVO)" and "oneversus-rest (OVR)" methods are generally applied to extend the traditional SVM. In this study, the "OVO" strategy was used. The OVO strategy constructs k × ðk − 1Þ/2 classifiers with each one trained with the data from two different classes. SVM has been successfully applied in the field of computational biology and bioinformatics [55][56][57][58][59][60][61][62][63][64]. In this paper, the LibSVM package was used to predict HSPs, which can be downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvm.

Performance Evaluation.
In statistical prediction, three cross-validation tests are commonly used to examine a predictor for its effectiveness in practical application: the k-fold cross-validation (subsampling test), the independent dataset test, and the jackknife test. Among the three methods, the jackknife test is deemed the most objective and rigorous one. In the jackknife test, each sample in the training dataset is in turn singled out as an independent test sample and all the rule parameters are calculated based on the remaining dataset without including the one being identified. Hence, the jackknife test was used to evaluate performance in this paper. To evaluate the predictive capability and reliability of our model, the performance of the classification algorithm is measured using the following: sensitivity (Sn), specificity (Sp), accuracy (Acc), Matthew's correlation coefficient (MCC), and overall accuracy (OA) [65][66][67][68][69][70][71][72][73][74][75]. The performance of the classification algorithm is measured through the following: Computational and Mathematical Methods in Medicine where TP represents the true positive, TN represents the true negative, FP represents the false positive, and FN represents the false negative. m = 6 is the number of subsets, and N is the number of total sequences of HSP families.

Results and Discussion
3.1. The Predictive Performance of HSPs. In order to investigate the effectiveness of the predictive model, many    [76,77]. Then, the split amino acid composition (SAAC), the dipeptide composition (DC), the conjoint triad feature (CTF), and the pseudoaverage chemical shift (PseACS) were selected to predict the HSPs. Table 3 lists the predictive performance of HSPs using individual features with the SVM classification algorithm without SMOTE; the highest overall accuracy (OA) of an individual parameter is 91.38% with the jackknife test by using PseACS. Individual features identify the families of HSPs with an overall accuracy (OA) ranging from 80.92% to 91.38%. Figure 2 shows the predictive results of different combined features of HSPs with SVM without SMOTE. The results show that the combined feature of SAAC+DC+CTF  Figure 4: The predictive overall accuracy of HSPs by using four algorithms. 6 Computational and Mathematical Methods in Medicine +PseACS was better than the other parameters. The overall accuracy (OA) of the combined feature of SAAC+DC+CTF +PseACS was 94.91% with the jackknife test. This result indicated that the combined feature was powerful in predicting HSPs. Table 4 lists the predictive performance of HSP families using the optimized combination feature SAAC+DC+CTF +PseACS with and without SMOTE. In the models with SMOTE, the Sn, Sp, Acc, and MCC of HSP families improved remarkably. For example, for HSP20 with SMOTE, Sn = 100%, Sp = 99:92%, MCC = 1, and Acc = 99:93%, which are 5.65%, 1.34%, 0.08, and 2.04% higher than those without SMOTE. In addition, OA = 99:72% with SMOTE, which is 4.81% higher than HSP families without SMOTE. The results indicate that the combined parameter SAAC+DC+CTF +PseACS with SMOTE was helpful in enhancing predictive performance.

Comparison with Other
Algorithms. The predictive performance of our predictive model (SVM), Random Forest (RF) [78], Naive Bayes (NB), and K-Nearest Neighbors (KNN) [79] is shown in Figures 3 and 4. From Figure 3, we can see that the differences of the Sn, Sp, MCC, and Acc of the HSP families are obvious. The Sn of HSP60, HSP70, HSP90, and HSP100 using SVM and KNN were all 100%. The Sp of HSP20 using KNN and SVM were similar, and the Sp of HSP40 using SVM and KNN were 100%. The MCC of HSP20 and HSP90 using SVM and KNN were both 1. The Acc of HSP20 using KNN and SVM were similar. In addition, from Figure 4, we can see that the value of OA with SVM was 99.72%, which was 4.39%, 7.07%, and 18.99% higher than RF, KNN, and NB, respectively. The highest value of the other parameters was obtained by SVM. Therefore, the experimental results show that SVM has achieved the best measures.    Figure 5 shows the predictive performance of HSP families using independent datasets. In the HGNC independent dataset, the OA of our predictive model was 98.96%, which was 11.60% and 11.46% higher than PredHSP and ir-HSP, respectively. In the RICE independent dataset, the OA of our predictive model reached 99.31%, which was 4.76% and 2.95% higher than PredHSP and ir-HSP, respectively. From the comparison, we can draw a conclusion that the applicability and accuracy of our prediction model for HSP prediction were improved.
3.3. Comparison with Existing Methods. In order to evaluate the performance of our predictive model, we made comparisons with existing methods. The method developed by Ahmad et al. did not provide any family-wise accuracy of HSPs, so we compared the effectiveness with iHSP-PseR-AAAC, PredHSP, and ir-HSP. The results of the comparisons are shown in Table 5. We can see that the Sn, Sp, Acc, and MCC of HSP families in our predictive model were higher than those of PredHSP, iHSP-PseRAAAC, and ir-HSP. For example, in our predictive model, Sn = 100%, Sp = 99:92%, MCC = 1, and Acc = 99:93% for HSP20 exceeded those of ir-HSP, PredHSP, and iHSP-PseRAAAC. In addition, in our predictive model, Sn = 100 for all HSP families, except for HSP40 Sn = 98:33%. Furthermore, the overall accuracy was 99.72% in our predictive model. These results indicate that our predictive model was superior to existing methods.

Conclusion
In this work, an optimized classifier for HSP family identification was developed. This model was derived from the SVM machine learning algorithm, and SMOTE was used for the imbalanced data classification problems. The overall accuracy was 99.72% with the balanced dataset and the jackknife test by using the optimized combination feature SAAC +DC+CTF+PseACS. High overall accuracy results indicate that our predictive model is a reliable tool for HSP family prediction. It is known that HSP expression is associated with human diseases, and these families of HSPs have different functions. Therefore, our predictive model will benefit researchers by quickly and effectively identifying HSP families and enabling researchers to design new drugs to achieve the goal of treating diseases.

Data Availability
The data used to support the findings of this study are available from the supplementary materials.

Conflicts of Interest
The authors declare that there is no conflict of interest.