Skip to main content

Advertisement

Log in

SulSite-GTB: identification of protein S-sulfenylation sites by fusing multiple feature information and gradient tree boosting

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Protein cysteine S-sulfenylation is an essential and reversible post-translational modification that plays a crucial role in transcriptional regulation, stress response, cell signaling and protein function. Studies have shown that S-sulfenylation is involved in many human diseases such as cancer, diabetes and arteriosclerosis. However, experimental identification of protein S-sulfenylation sites is generally expensive and time-consuming. In this study, we proposed a new protein S-sulfenylation sites prediction method SulSite-GTB. First, fusion of amino acid composition, dipeptide composition, encoding based on grouped weight, K nearest neighbors, position-specific amino acid propensity, position-weighted amino acid composition and pseudo-position specific score matrix feature extraction to obtain the initial feature space. Secondly, we use the synthetic minority oversampling technique (SMOTE) algorithm to process the class imbalance data, and the least absolute shrinkage and selection operator (LASSO) are employed to remove the redundant and irrelevant features. Finally, the optimal feature subset is input into the gradient tree boosting classifier to predict the S-sulfenylation sites, and the five-fold cross-validation and independent test set method are used to evaluate the prediction performance of the model. Experimental results showed the overall prediction accuracy is 92.86% and 88.53%, respectively, and the AUC values are 0.9706 and 0.9425, respectively, on the training set and the independent test set. Compared with other prediction methods, the results show that the proposed method SulSite-GTB is significantly superior to other state-of-the-art methods and provides a new idea for the prediction of post-translational modification sites of other proteins. The source code and all datasets are available at https://github.com/QUST-AIBBDRC/SulSite-GTB/.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Matthias M, Jensen ON (2003) Proteomic analysis of post-translational modifications. Nat Biotechnol 21:255–261

    Google Scholar 

  2. Wei W, Liu Q, Yi T, Liu L, Li X, Lu C (2009) Oxidative stress, diabetes, and diabetic complications. Hemoglobin 33:370–377

    Google Scholar 

  3. Prabhu L, Hartley AV, Martin M, Warsame F, Sun E, Tao L (2015) Role of post-translational modification of the Y box binding protein 1 in human cancers. Genes Dis 2:240–246

    Google Scholar 

  4. Paulsen CE, Carroll KS (2013) Cysteine-mediated redox signaling: chemistry, biology, and tools for discovery. Chem Rev 113:4633–4679

    Google Scholar 

  5. Paulsen CE, Truong TH, Garcia FJ, Homann A, Gupta V, Leonard SE, Carroll KS (2012) Peroxide-dependent sulfenylation of the EGFR catalytic site enhances kinase activity. Nat Chem Biol 8:57–64

    Google Scholar 

  6. Yang J, Gupta V, Carroll KS, Liebler DC (2014) Site-specific mapping and quantification of protein S-sulphenylation in cells. Nat Commun 5:4776

    Google Scholar 

  7. Leonard SE, Carroll KS (2011) Chemical ‘omics’ approaches to understanding protein cysteine oxidation in biology. Curr Opin Chem Biol 15:88–102

    Google Scholar 

  8. Poole LB, Nelson KJ (2008) Discovering mechanisms of signaling-mediated cysteine oxidation. Curr Opin Chem Biol 12:18–24

    Google Scholar 

  9. Revati W, Jiang Q, Leimiao Y, Erika BS, Bruce K, Poole LB, Eunok P, Tsang AW, Furdui CM (2011) Isoform-specific regulation of Akt by PDGF-induced reactive oxygen species. Proc Natl Acad Sci 108:10550–10555

    Google Scholar 

  10. Goedele R, Joris M (2011) Protein sulfenic acid formation: from cellular damage to redox regulation. Free Radic Biol Med 51:314–326

    Google Scholar 

  11. Leonard SE, Reddie KG, Carroll KS (2009) Mining the thiol proteome for sulfenic acid modifications reveals new targets for oxidation in cells. ACS Chem Biol 4:783–799

    Google Scholar 

  12. Chen Z, Liu XH, Li FY, Li C, Marquez-Lago T, Leier A, Akutsu T, Webb GI, Xu DK, Smith AI, Li L, Chou KC, Song JN (2018) Large-scale comparative assessment of computational predictors for lysine post-translational modification sites. Brief Bioinform 20:2267–2290

    Google Scholar 

  13. Weng SL, Kao HJ, Huang CH, Lee TY (2017) MDD-Palm: identification of protein S-palmitoylation sites with substrate motifs based on maximal dependence decomposition. PLoS ONE 12:e0179529

    Google Scholar 

  14. Cui XW, Yu ZM, Yu B, Wang MH, Tian BG, Ma Q (2019) UbiSitePred: a novel method for improving the accuracy of ubiquitination sites prediction by using LASSO to select the optimal Chou’s pseudo components. Chemometr Intell Lab Syst 184:28–43

    Google Scholar 

  15. Chen YJ, Lu CT, Huang KY, Wu HY, Chen YJ, Lee TY (2015) GSHSite: exploiting an iteratively statistical method to identify S-glutathionylation sites with substrate specificity. PLoS ONE 10:e0118752

    Google Scholar 

  16. Xie YB, Luo X, Li Y, Chen L, Ma W, Huang J, Cui J, Zhao Y, Xue Y, Zuo Z (2018) DeepNitro: prediction of protein nitration and nitrosylation sites by deep learning. Genom Proteom Bioinform 16:294–306

    Google Scholar 

  17. Wuyun Q, Zheng W, Zhang Y, Ruan J, Hu G (2016) Improved species-specific lysine acetylation site prediction based on a large variety of features set. PLoS ONE 11:e0155370

    Google Scholar 

  18. Cai Y, Hu L, Shi X, Xie L, Li Y (2012) Prediction of lysine ubiquitination with mRMR feature selection and analysis. Amino Acids 42:1387–1395

    Google Scholar 

  19. Wen PP, Shi SP, Xu HD, Wang LN, Qiu JD (2016) Accurate in silico prediction of species-specific methylation sites based on information gain feature optimization. Bioinformatics 32:3107–3115

    Google Scholar 

  20. Zhao XW, Zhao XS, Bao LL, Zhang YG, Dai JY, Yin MH (2017) Glypre: in silico prediction of protein glycation sites by fusing multiple features and support vector machine. Molecules 22:1891

    Google Scholar 

  21. Yu JL, Shi SP, Zhang F, Chen GD, Cao M (2019) PredGly: predicting lysine glycation sites for Homo sapiens based on XGboost feature optimization. Bioinformatics 35:2749–2756

    Google Scholar 

  22. Ning Q, Zhao X, Bao L, Ma Z, Zhao X (2018) Detecting succinylation sites from protein sequences using ensemble support vector machine. BMC Bioinform 19:237

    Google Scholar 

  23. Zuo Y, Jia CZ (2017) CarSite: identify carbonylated sites of human proteins based on a one-sided selection resampling method. Mol Biosyst 13:2362–2369

    Google Scholar 

  24. Hu J, He X, Yu DJ, Yang XB, Yang JY, Shen HB (2014) A new supervised over-sampling algorithm with application to protein-nucleotide binding residue prediction. PLoS ONE 9:e107676

    Google Scholar 

  25. Jia CZ, Zuo Y (2017) S-SulfPred: a sensitive predictor to capture S-sulfenylation sites based on a resampling one-sided selection undersampling-synthetic minority oversampling technique. J Theor Biol 422:84–89

    Google Scholar 

  26. Johansen MB, Kiemer L, Brunak S (2006) Analysis and prediction of mammalian protein glycation. Glycobiology 16:844–853

    Google Scholar 

  27. Khan YD, Rasool N, Hussain W, Khan SA, Chou KC (2018) iPhosY-PseAAC: identify phosphotyrosine sites by incorporating sequence statistical moments into PseAAC. Mol Biol Rep 45:2501–2509

    Google Scholar 

  28. Hou T, Zheng GY, Zhang PY, Jia J, Li J, Xie L, Wei CC, Li YX (2014) LAceP: lysine acetylation site prediction using logistic regression classifiers. PLoS One 9:e89575

    Google Scholar 

  29. Li FY, Li C, Marquez-Lago TT, Leier A, Akutsu T, Purcell AW, Smith AI, Lithgow T, Daly RJ, Song J (2018) Quokka: a comprehensive tool for rapid and accurate prediction of kinase family-specific phosphorylation sites in the human proteome. Bioinformatics 34:4223–4231

    Google Scholar 

  30. Li Y, Wang M, Wang H, Tan H, Zhang Z, Webb GI, Song J (2014) Accurate in silico identification of species-specific acetylation sites by integrating protein sequence-derived and functional features. Sci Rep 4:5765

    Google Scholar 

  31. Qiu WR, Sun BQ, Tang H, Huang J, Lin H (2017) Identify and analysis crotonylation sites in histone by using support vector machines. Artif Intell Med 83:75–81

    Google Scholar 

  32. Qiu WR, Sun BQ, Xiao X, Xu ZC, Jia JH, Chou KC (2017) iKcr-PseEns: identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classifier. Genomics 110:239–246

    Google Scholar 

  33. Wei L, Xing P, Shi G, Ji ZL, Zou Q (2017) Fast prediction of protein methylation sites using a sequence-based feature selection technique. IEEE/ACM Trans Comput Biol Bioinform 16:1264–1273

    Google Scholar 

  34. Luo FL, Wang MH, Liu Y, Zhao XM, Li A (2019) DeepPhos: prediction of protein phosphorylation sites with deep learning. Bioinformatics 33:2766–2773

    Google Scholar 

  35. He F, Wang R, Li J, Bao L, Xu D, Zhao X (2018) Large-scale prediction of protein ubiquitination sites using a multimodal deep architecture. BMC Syst Biol 12:109

    Google Scholar 

  36. Bui VM, Lu CT, Ho TT, Lee TY (2015) MDD-SOH: exploiting maximal dependence decomposition to identify S-sulfenylation sites with substrate motifs. Bioinformatics 32:165–172

    Google Scholar 

  37. Bui VM, Weng SL, Lu CT, Chang TH, Weng TY, Lee TY (2016) SOHSite: incorporating evolutionary information and physicochemical properties to identify protein S-sulfenylation sites. BMC Genom 17:9

    Google Scholar 

  38. Xu Y, Ding J, Wu LY (2016) iSulf-Cys: prediction of S-sulfenylation sites in proteins with physicochemical properties of amino acids. PLoS One 11:e0154237

    Google Scholar 

  39. Sakka M, Tzortzis G, Mantzaris MD, Bekas N, Kellici TF, Likas A, Galaris D, Gerothanassis IP, Tzakos AG (2016) PRESS: protein S-sulfenylation server. Bioinformatics 32:2710–2712

    Google Scholar 

  40. Wang XF, Yan RX, Li JY, Song J (2016) SOHPRED: a new bioinformatics tool for the characterization and prediction of human S-sulfenylation sites. Mol Biosyst 12:2849–2858

    Google Scholar 

  41. Hasan MM, Guo D, Kurata H (2017) Computational identification of protein S-sulfenylation sites by incorporating the multiple sequence features information. Mol Biosyst 13:2545–2550

    Google Scholar 

  42. Deng L, Xu XJ, Liu H (2018) PredCSO: an ensemble method for prediction of S-sulfenylation sites in proteins. Mol Omics 14:257–265

    Google Scholar 

  43. Ju Z, Wang SY (2018) Prediction of S-sulfenylation sites using mRMR feature selection and fuzzy support vector machine algorithm. J Theor Biol 457:6–13

    MATH  Google Scholar 

  44. Wang L, Zhang R, Mu Y (2019) Fu-SulfPred: identification of Protein S-sulfenylation Sites by Fusing Forests via Chou’s General PseAAC. J Theor Biol 461:51–58

    MATH  Google Scholar 

  45. Sun MA, Wang Y, Cheng H, Zhang Q, Ge W, Guo D (2012) RedoxDB-a curated database for experimentally verified protein oxidative modification. Bioinformatics 28:2551–2552

    Google Scholar 

  46. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659

    Google Scholar 

  47. Du XQ, Sun SW, Hu CJ, Yao Y, Yan YT, Zhang YP (2017) DeepPPI: boosting prediction of protein-protein interactions with deep neural networks. J Chem Inf Model 57:1499–1510

    Google Scholar 

  48. Manoj B, Raghava GPS (2004) Classification of nuclear receptors based on amino acid composition and dipeptide composition. J Biol Chem 279:23262–23266

    Google Scholar 

  49. Khan A, Majid A, Hayat M (2011) CE-PLoc: an ensemble classifier for predicting protein subcellular locations by fusing different modes of pseudo amino acid composition. Comput Biol Chem 35:218–229

    MathSciNet  MATH  Google Scholar 

  50. Zhang ZH, Wang ZH, Zhang ZR, Wang YX (2006) A novel method for apoptosis protein subcellular localization prediction combining encoding based on grouped weight and support vector machine. FEBS Lett 580:6169–6174

    Google Scholar 

  51. Chen Z, Zhao P, Li F, Leier A, Marquez-Lago TT, Wang Y, Webb GI, Smith AI, Daly RJ, Chou KC (2018) iFeature: a python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 34:2499–2502

    Google Scholar 

  52. Tang YR, Chen YZ, Canchaya CA, Zhang ZD (2007) GANNPhos: a new phosphorylation site predictor based on a genetic algorithm integrated neural network. Protein Eng Des Sel 20:405–412

    Google Scholar 

  53. Jones D (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292:195–202

    Google Scholar 

  54. Yu B, Li S, Qiu WY, Chen C, Chen RX, Wang L, Wang MH, Zhang Y (2017) Accurate prediction of subcellular location of apoptosis proteins combining Chou’s PseAAC and PsePSSM based on wavelet denoising. Oncotarget 8:107640–107665

    Google Scholar 

  55. Yu B, Li S, Qiu WY, Wang MH, Du JW, Zhang YS, Chen X (2018) Prediction of subcellular location of apoptosis proteins by incorporating PsePSSM and DCCA coefficient based on LFDA dimensionality reduction. BMC Genom 19:478

    Google Scholar 

  56. Qiu WY, Li S, Cui XW, Yu ZM, Wang MH, Du JW, Peng YJ, Yu B (2018) Predicting protein submitochondrial locations by incorporating the pseudo-position specific scoring matrix into the general Chou’s pseudo-amino acid composition. J Theor Biol 450:86–103

    MathSciNet  MATH  Google Scholar 

  57. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402

    Google Scholar 

  58. Liu TG, Geng XB, Zheng XQ, Li RS, Wang J (2012) Accurate prediction of protein structural class using auto covariance transformation of PSI-BLAST profiles. Amino Acids 42:2243–2249

    Google Scholar 

  59. Shen HB, Chou KC (2007) Nuc-PLoc: a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM. Protein Eng Des Sel 20:561–567

    Google Scholar 

  60. Huang SY, Shi SP, Qiu JD, Liu MC (2015) Using support vector machines to identify protein phosphorylation sites in viruses. J Mol Graph Model 56:84–90

    Google Scholar 

  61. Shi SP, Qiu JD, Sun XY, Suo SB, Huang SY, Liang RP (2012) PMeS: prediction of methylation sites based on enhanced feature encoding scheme. PLoS One 7:e38772

    Google Scholar 

  62. Shi SP, Qiu JD, Sun XY, Suo SB, Huang SY, Liang RP (2012) A method to distinguish between lysine acetylation and lysine methylation from protein sequences. J Theor Biol 310:223–230

    MATH  Google Scholar 

  63. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  64. Wang XY, Yu B, Ma AJ, Chen C, Liu BQ, Ma Q (2019) Protein-protein interaction sites prediction by ensemble random forests with synthetic minority oversampling technique. Bioinformatics 35:2395–2402

    Google Scholar 

  65. Shi H, Liu SM, Chen JQ, Li X, Ma Q, Yu B (2019) Predicting drug-target interactions using Lasso with random forest based on evolutionary information and chemical structure. Genomics 111:1839–1852

    Google Scholar 

  66. Yu B, Qiu WY, Chen C, Ma AJ, Jiang J, Zhou HY, Ma Q (2020) SubMito-XGBoost: predicting protein submitochondrial localization by fusing multiple feature information and eXtreme gradient boosting. Bioinformatics 36:1074–1081

    Google Scholar 

  67. Kang CZ, Huo YH, Xin LH, Tian BG, Yu B (2019) Feature selection and tumor classification for microarray data using relaxed Lasso and generalized multi-class support vector machine. J Theor Biol 463:77–91

    MathSciNet  MATH  Google Scholar 

  68. Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc B 58:267–288

    MathSciNet  MATH  Google Scholar 

  69. Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232

    MathSciNet  MATH  Google Scholar 

  70. Liu Y, Gu Y, Nguyen JC, Li H, Zhang J, Gao Y, Huang Y (2017) Symptom severity classification with gradient tree boosting. J Biomed Inform 75:105–111

    Google Scholar 

  71. Pan Y, Liu D, Deng L (2017) Accurate prediction of functional effects for variants by combining gradient tree boosting with optimal neighborhood properties. PLoS One 12:e0179314

    Google Scholar 

  72. Fan C, Liu D, Huang R, Chen Z, Deng L (2016) PredRSA: a gradient boosted regression trees approach for predicting protein solvent accessibility. BMC Bioinform 17:8

    Google Scholar 

  73. Yu B, Li S, Chen C, Xu JM, Qiu WY, Wu X, Chen RX (2017) Prediction subcellular localization of Gram-negative bacterial proteins by support vector machine using wavelet denoising and Chou’s pseudo amino acid composition. Chemometr Intell Lab 167:102–112

    Google Scholar 

  74. Chen C, Zhang QM, Ma Q, Yu B (2019) LightGBM-PPI: predicting protein-protein interactions through LightGBM with multi-information fusion. Chemometr Intell Lab Syst 191:54–64

    Google Scholar 

  75. Vladimir V, Iakoucheva LM, Predrag R (2006) Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics 22:1536–1537

    Google Scholar 

  76. Yu B, Lou LF, Li S, Zhang YS, Qiu WY, Wu X, Wang MH, Tian BG (2017) Prediction of protein structural class for low-similarity sequences using Chou’s pseudo amino acid composition and wavelet denoising. J Mol Graph Model 76:260–273

    Google Scholar 

  77. Zhu J, Zou H, Rosset S, Hastie T (2006) Multi-class adaboost. Stat Interface 2:349–360

    MATH  Google Scholar 

  78. Zhang H, Liu G, Chow TW, Liu W (2011) Textual and visual content-based anti-phishing: a Bayesian approach. IEEE Trans Neural Netw 22:1532–1546

    Google Scholar 

Download references

Acknowledgements

This work was supported by the National Nature Science Foundation of China (No. 61863010, 11771188), the Key Research and Development Program of Shandong Province of China (No. 2019GGX101001), and the Natural Science Foundation of Shandong Province of China (No. ZR2018MC007). This work used the Extreme Science and Engineering Discovery Environment, which is supported by the National Science Foundation (No. ACI-1548562).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bin Yu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, M., Cui, X., Yu, B. et al. SulSite-GTB: identification of protein S-sulfenylation sites by fusing multiple feature information and gradient tree boosting. Neural Comput & Applic 32, 13843–13862 (2020). https://doi.org/10.1007/s00521-020-04792-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-020-04792-z

Keywords

Navigation