SulSite-GTB: identification of protein S-sulfenylation sites by fusing multiple feature information and gradient tree boosting

Wang, Minghui; Cui, Xiaowen; Yu, Bin; Chen, Cheng; Ma, Qin; Zhou, Hongyan

doi:10.1007/s00521-020-04792-z

SulSite-GTB: identification of protein S-sulfenylation sites by fusing multiple feature information and gradient tree boosting

Original Article
Published: 13 March 2020

Volume 32, pages 13843–13862, (2020)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Minghui Wang^1,2,
Xiaowen Cui^1,2,
Bin Yu ORCID: orcid.org/0000-0002-2453-7852^1,2,3,
Cheng Chen^1,2,
Qin Ma⁴ &
…
Hongyan Zhou^1,2

514 Accesses
28 Citations
Explore all metrics

Abstract

Protein cysteine S-sulfenylation is an essential and reversible post-translational modification that plays a crucial role in transcriptional regulation, stress response, cell signaling and protein function. Studies have shown that S-sulfenylation is involved in many human diseases such as cancer, diabetes and arteriosclerosis. However, experimental identification of protein S-sulfenylation sites is generally expensive and time-consuming. In this study, we proposed a new protein S-sulfenylation sites prediction method SulSite-GTB. First, fusion of amino acid composition, dipeptide composition, encoding based on grouped weight, K nearest neighbors, position-specific amino acid propensity, position-weighted amino acid composition and pseudo-position specific score matrix feature extraction to obtain the initial feature space. Secondly, we use the synthetic minority oversampling technique (SMOTE) algorithm to process the class imbalance data, and the least absolute shrinkage and selection operator (LASSO) are employed to remove the redundant and irrelevant features. Finally, the optimal feature subset is input into the gradient tree boosting classifier to predict the S-sulfenylation sites, and the five-fold cross-validation and independent test set method are used to evaluate the prediction performance of the model. Experimental results showed the overall prediction accuracy is 92.86% and 88.53%, respectively, and the AUC values are 0.9706 and 0.9425, respectively, on the training set and the independent test set. Compared with other prediction methods, the results show that the proposed method SulSite-GTB is significantly superior to other state-of-the-art methods and provides a new idea for the prediction of post-translational modification sites of other proteins. The source code and all datasets are available at https://github.com/QUST-AIBBDRC/SulSite-GTB/.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SOHSite: incorporating evolutionary information and physicochemical properties to identify protein S-sulfenylation sites

Article Open access 11 January 2016

SVM-SulfoSite: A support vector machine based predictor for sulfenylation sites

Article Open access 26 July 2018

SIMLIN: a bioinformatics tool for prediction of S-sulphenylation in the human proteome based on multi-stage ensemble-learning models

Article Open access 21 November 2019

References

Matthias M, Jensen ON (2003) Proteomic analysis of post-translational modifications. Nat Biotechnol 21:255–261
Google Scholar
Wei W, Liu Q, Yi T, Liu L, Li X, Lu C (2009) Oxidative stress, diabetes, and diabetic complications. Hemoglobin 33:370–377
Google Scholar
Prabhu L, Hartley AV, Martin M, Warsame F, Sun E, Tao L (2015) Role of post-translational modification of the Y box binding protein 1 in human cancers. Genes Dis 2:240–246
Google Scholar
Paulsen CE, Carroll KS (2013) Cysteine-mediated redox signaling: chemistry, biology, and tools for discovery. Chem Rev 113:4633–4679
Google Scholar
Paulsen CE, Truong TH, Garcia FJ, Homann A, Gupta V, Leonard SE, Carroll KS (2012) Peroxide-dependent sulfenylation of the EGFR catalytic site enhances kinase activity. Nat Chem Biol 8:57–64
Google Scholar
Yang J, Gupta V, Carroll KS, Liebler DC (2014) Site-specific mapping and quantification of protein S-sulphenylation in cells. Nat Commun 5:4776
Google Scholar
Leonard SE, Carroll KS (2011) Chemical ‘omics’ approaches to understanding protein cysteine oxidation in biology. Curr Opin Chem Biol 15:88–102
Google Scholar
Poole LB, Nelson KJ (2008) Discovering mechanisms of signaling-mediated cysteine oxidation. Curr Opin Chem Biol 12:18–24
Google Scholar
Revati W, Jiang Q, Leimiao Y, Erika BS, Bruce K, Poole LB, Eunok P, Tsang AW, Furdui CM (2011) Isoform-specific regulation of Akt by PDGF-induced reactive oxygen species. Proc Natl Acad Sci 108:10550–10555
Google Scholar
Goedele R, Joris M (2011) Protein sulfenic acid formation: from cellular damage to redox regulation. Free Radic Biol Med 51:314–326
Google Scholar
Leonard SE, Reddie KG, Carroll KS (2009) Mining the thiol proteome for sulfenic acid modifications reveals new targets for oxidation in cells. ACS Chem Biol 4:783–799
Google Scholar
Chen Z, Liu XH, Li FY, Li C, Marquez-Lago T, Leier A, Akutsu T, Webb GI, Xu DK, Smith AI, Li L, Chou KC, Song JN (2018) Large-scale comparative assessment of computational predictors for lysine post-translational modification sites. Brief Bioinform 20:2267–2290
Google Scholar
Weng SL, Kao HJ, Huang CH, Lee TY (2017) MDD-Palm: identification of protein S-palmitoylation sites with substrate motifs based on maximal dependence decomposition. PLoS ONE 12:e0179529
Google Scholar
Cui XW, Yu ZM, Yu B, Wang MH, Tian BG, Ma Q (2019) UbiSitePred: a novel method for improving the accuracy of ubiquitination sites prediction by using LASSO to select the optimal Chou’s pseudo components. Chemometr Intell Lab Syst 184:28–43
Google Scholar
Chen YJ, Lu CT, Huang KY, Wu HY, Chen YJ, Lee TY (2015) GSHSite: exploiting an iteratively statistical method to identify S-glutathionylation sites with substrate specificity. PLoS ONE 10:e0118752
Google Scholar
Xie YB, Luo X, Li Y, Chen L, Ma W, Huang J, Cui J, Zhao Y, Xue Y, Zuo Z (2018) DeepNitro: prediction of protein nitration and nitrosylation sites by deep learning. Genom Proteom Bioinform 16:294–306
Google Scholar
Wuyun Q, Zheng W, Zhang Y, Ruan J, Hu G (2016) Improved species-specific lysine acetylation site prediction based on a large variety of features set. PLoS ONE 11:e0155370
Google Scholar
Cai Y, Hu L, Shi X, Xie L, Li Y (2012) Prediction of lysine ubiquitination with mRMR feature selection and analysis. Amino Acids 42:1387–1395
Google Scholar
Wen PP, Shi SP, Xu HD, Wang LN, Qiu JD (2016) Accurate in silico prediction of species-specific methylation sites based on information gain feature optimization. Bioinformatics 32:3107–3115
Google Scholar
Zhao XW, Zhao XS, Bao LL, Zhang YG, Dai JY, Yin MH (2017) Glypre: in silico prediction of protein glycation sites by fusing multiple features and support vector machine. Molecules 22:1891
Google Scholar
Yu JL, Shi SP, Zhang F, Chen GD, Cao M (2019) PredGly: predicting lysine glycation sites for Homo sapiens based on XGboost feature optimization. Bioinformatics 35:2749–2756
Google Scholar
Ning Q, Zhao X, Bao L, Ma Z, Zhao X (2018) Detecting succinylation sites from protein sequences using ensemble support vector machine. BMC Bioinform 19:237
Google Scholar
Zuo Y, Jia CZ (2017) CarSite: identify carbonylated sites of human proteins based on a one-sided selection resampling method. Mol Biosyst 13:2362–2369
Google Scholar
Hu J, He X, Yu DJ, Yang XB, Yang JY, Shen HB (2014) A new supervised over-sampling algorithm with application to protein-nucleotide binding residue prediction. PLoS ONE 9:e107676
Google Scholar
Jia CZ, Zuo Y (2017) S-SulfPred: a sensitive predictor to capture S-sulfenylation sites based on a resampling one-sided selection undersampling-synthetic minority oversampling technique. J Theor Biol 422:84–89
Google Scholar
Johansen MB, Kiemer L, Brunak S (2006) Analysis and prediction of mammalian protein glycation. Glycobiology 16:844–853
Google Scholar
Khan YD, Rasool N, Hussain W, Khan SA, Chou KC (2018) iPhosY-PseAAC: identify phosphotyrosine sites by incorporating sequence statistical moments into PseAAC. Mol Biol Rep 45:2501–2509
Google Scholar
Hou T, Zheng GY, Zhang PY, Jia J, Li J, Xie L, Wei CC, Li YX (2014) LAceP: lysine acetylation site prediction using logistic regression classifiers. PLoS One 9:e89575
Google Scholar
Li FY, Li C, Marquez-Lago TT, Leier A, Akutsu T, Purcell AW, Smith AI, Lithgow T, Daly RJ, Song J (2018) Quokka: a comprehensive tool for rapid and accurate prediction of kinase family-specific phosphorylation sites in the human proteome. Bioinformatics 34:4223–4231
Google Scholar
Li Y, Wang M, Wang H, Tan H, Zhang Z, Webb GI, Song J (2014) Accurate in silico identification of species-specific acetylation sites by integrating protein sequence-derived and functional features. Sci Rep 4:5765
Google Scholar
Qiu WR, Sun BQ, Tang H, Huang J, Lin H (2017) Identify and analysis crotonylation sites in histone by using support vector machines. Artif Intell Med 83:75–81
Google Scholar
Qiu WR, Sun BQ, Xiao X, Xu ZC, Jia JH, Chou KC (2017) iKcr-PseEns: identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classifier. Genomics 110:239–246
Google Scholar
Wei L, Xing P, Shi G, Ji ZL, Zou Q (2017) Fast prediction of protein methylation sites using a sequence-based feature selection technique. IEEE/ACM Trans Comput Biol Bioinform 16:1264–1273
Google Scholar
Luo FL, Wang MH, Liu Y, Zhao XM, Li A (2019) DeepPhos: prediction of protein phosphorylation sites with deep learning. Bioinformatics 33:2766–2773
Google Scholar
He F, Wang R, Li J, Bao L, Xu D, Zhao X (2018) Large-scale prediction of protein ubiquitination sites using a multimodal deep architecture. BMC Syst Biol 12:109
Google Scholar
Bui VM, Lu CT, Ho TT, Lee TY (2015) MDD-SOH: exploiting maximal dependence decomposition to identify S-sulfenylation sites with substrate motifs. Bioinformatics 32:165–172
Google Scholar
Bui VM, Weng SL, Lu CT, Chang TH, Weng TY, Lee TY (2016) SOHSite: incorporating evolutionary information and physicochemical properties to identify protein S-sulfenylation sites. BMC Genom 17:9
Google Scholar
Xu Y, Ding J, Wu LY (2016) iSulf-Cys: prediction of S-sulfenylation sites in proteins with physicochemical properties of amino acids. PLoS One 11:e0154237
Google Scholar
Sakka M, Tzortzis G, Mantzaris MD, Bekas N, Kellici TF, Likas A, Galaris D, Gerothanassis IP, Tzakos AG (2016) PRESS: protein S-sulfenylation server. Bioinformatics 32:2710–2712
Google Scholar
Wang XF, Yan RX, Li JY, Song J (2016) SOHPRED: a new bioinformatics tool for the characterization and prediction of human S-sulfenylation sites. Mol Biosyst 12:2849–2858
Google Scholar
Hasan MM, Guo D, Kurata H (2017) Computational identification of protein S-sulfenylation sites by incorporating the multiple sequence features information. Mol Biosyst 13:2545–2550
Google Scholar
Deng L, Xu XJ, Liu H (2018) PredCSO: an ensemble method for prediction of S-sulfenylation sites in proteins. Mol Omics 14:257–265
Google Scholar
Ju Z, Wang SY (2018) Prediction of S-sulfenylation sites using mRMR feature selection and fuzzy support vector machine algorithm. J Theor Biol 457:6–13
MATH Google Scholar
Wang L, Zhang R, Mu Y (2019) Fu-SulfPred: identification of Protein S-sulfenylation Sites by Fusing Forests via Chou’s General PseAAC. J Theor Biol 461:51–58
MATH Google Scholar
Sun MA, Wang Y, Cheng H, Zhang Q, Ge W, Guo D (2012) RedoxDB-a curated database for experimentally verified protein oxidative modification. Bioinformatics 28:2551–2552
Google Scholar
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659
Google Scholar
Du XQ, Sun SW, Hu CJ, Yao Y, Yan YT, Zhang YP (2017) DeepPPI: boosting prediction of protein-protein interactions with deep neural networks. J Chem Inf Model 57:1499–1510
Google Scholar
Manoj B, Raghava GPS (2004) Classification of nuclear receptors based on amino acid composition and dipeptide composition. J Biol Chem 279:23262–23266
Google Scholar
Khan A, Majid A, Hayat M (2011) CE-PLoc: an ensemble classifier for predicting protein subcellular locations by fusing different modes of pseudo amino acid composition. Comput Biol Chem 35:218–229
MathSciNet MATH Google Scholar
Zhang ZH, Wang ZH, Zhang ZR, Wang YX (2006) A novel method for apoptosis protein subcellular localization prediction combining encoding based on grouped weight and support vector machine. FEBS Lett 580:6169–6174
Google Scholar
Chen Z, Zhao P, Li F, Leier A, Marquez-Lago TT, Wang Y, Webb GI, Smith AI, Daly RJ, Chou KC (2018) iFeature: a python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 34:2499–2502
Google Scholar
Tang YR, Chen YZ, Canchaya CA, Zhang ZD (2007) GANNPhos: a new phosphorylation site predictor based on a genetic algorithm integrated neural network. Protein Eng Des Sel 20:405–412
Google Scholar
Jones D (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292:195–202
Google Scholar
Yu B, Li S, Qiu WY, Chen C, Chen RX, Wang L, Wang MH, Zhang Y (2017) Accurate prediction of subcellular location of apoptosis proteins combining Chou’s PseAAC and PsePSSM based on wavelet denoising. Oncotarget 8:107640–107665
Google Scholar
Yu B, Li S, Qiu WY, Wang MH, Du JW, Zhang YS, Chen X (2018) Prediction of subcellular location of apoptosis proteins by incorporating PsePSSM and DCCA coefficient based on LFDA dimensionality reduction. BMC Genom 19:478
Google Scholar
Qiu WY, Li S, Cui XW, Yu ZM, Wang MH, Du JW, Peng YJ, Yu B (2018) Predicting protein submitochondrial locations by incorporating the pseudo-position specific scoring matrix into the general Chou’s pseudo-amino acid composition. J Theor Biol 450:86–103
MathSciNet MATH Google Scholar
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Google Scholar
Liu TG, Geng XB, Zheng XQ, Li RS, Wang J (2012) Accurate prediction of protein structural class using auto covariance transformation of PSI-BLAST profiles. Amino Acids 42:2243–2249
Google Scholar
Shen HB, Chou KC (2007) Nuc-PLoc: a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM. Protein Eng Des Sel 20:561–567
Google Scholar
Huang SY, Shi SP, Qiu JD, Liu MC (2015) Using support vector machines to identify protein phosphorylation sites in viruses. J Mol Graph Model 56:84–90
Google Scholar
Shi SP, Qiu JD, Sun XY, Suo SB, Huang SY, Liang RP (2012) PMeS: prediction of methylation sites based on enhanced feature encoding scheme. PLoS One 7:e38772
Google Scholar
Shi SP, Qiu JD, Sun XY, Suo SB, Huang SY, Liang RP (2012) A method to distinguish between lysine acetylation and lysine methylation from protein sequences. J Theor Biol 310:223–230
MATH Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
MATH Google Scholar
Wang XY, Yu B, Ma AJ, Chen C, Liu BQ, Ma Q (2019) Protein-protein interaction sites prediction by ensemble random forests with synthetic minority oversampling technique. Bioinformatics 35:2395–2402
Google Scholar
Shi H, Liu SM, Chen JQ, Li X, Ma Q, Yu B (2019) Predicting drug-target interactions using Lasso with random forest based on evolutionary information and chemical structure. Genomics 111:1839–1852
Google Scholar
Yu B, Qiu WY, Chen C, Ma AJ, Jiang J, Zhou HY, Ma Q (2020) SubMito-XGBoost: predicting protein submitochondrial localization by fusing multiple feature information and eXtreme gradient boosting. Bioinformatics 36:1074–1081
Google Scholar
Kang CZ, Huo YH, Xin LH, Tian BG, Yu B (2019) Feature selection and tumor classification for microarray data using relaxed Lasso and generalized multi-class support vector machine. J Theor Biol 463:77–91
MathSciNet MATH Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc B 58:267–288
MathSciNet MATH Google Scholar
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232
MathSciNet MATH Google Scholar
Liu Y, Gu Y, Nguyen JC, Li H, Zhang J, Gao Y, Huang Y (2017) Symptom severity classification with gradient tree boosting. J Biomed Inform 75:105–111
Google Scholar
Pan Y, Liu D, Deng L (2017) Accurate prediction of functional effects for variants by combining gradient tree boosting with optimal neighborhood properties. PLoS One 12:e0179314
Google Scholar
Fan C, Liu D, Huang R, Chen Z, Deng L (2016) PredRSA: a gradient boosted regression trees approach for predicting protein solvent accessibility. BMC Bioinform 17:8
Google Scholar
Yu B, Li S, Chen C, Xu JM, Qiu WY, Wu X, Chen RX (2017) Prediction subcellular localization of Gram-negative bacterial proteins by support vector machine using wavelet denoising and Chou’s pseudo amino acid composition. Chemometr Intell Lab 167:102–112
Google Scholar
Chen C, Zhang QM, Ma Q, Yu B (2019) LightGBM-PPI: predicting protein-protein interactions through LightGBM with multi-information fusion. Chemometr Intell Lab Syst 191:54–64
Google Scholar
Vladimir V, Iakoucheva LM, Predrag R (2006) Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics 22:1536–1537
Google Scholar
Yu B, Lou LF, Li S, Zhang YS, Qiu WY, Wu X, Wang MH, Tian BG (2017) Prediction of protein structural class for low-similarity sequences using Chou’s pseudo amino acid composition and wavelet denoising. J Mol Graph Model 76:260–273
Google Scholar
Zhu J, Zou H, Rosset S, Hastie T (2006) Multi-class adaboost. Stat Interface 2:349–360
MATH Google Scholar
Zhang H, Liu G, Chow TW, Liu W (2011) Textual and visual content-based anti-phishing: a Bayesian approach. IEEE Trans Neural Netw 22:1532–1546
Google Scholar

Download references

Acknowledgements

This work was supported by the National Nature Science Foundation of China (No. 61863010, 11771188), the Key Research and Development Program of Shandong Province of China (No. 2019GGX101001), and the Natural Science Foundation of Shandong Province of China (No. ZR2018MC007). This work used the Extreme Science and Engineering Discovery Environment, which is supported by the National Science Foundation (No. ACI-1548562).

Author information

Authors and Affiliations

College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China
Minghui Wang, Xiaowen Cui, Bin Yu, Cheng Chen & Hongyan Zhou
Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
Minghui Wang, Xiaowen Cui, Bin Yu, Cheng Chen & Hongyan Zhou
School of Life Sciences, University of Science and Technology of China, Hefei, 230027, China
Bin Yu
Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
Qin Ma

Authors

Minghui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaowen Cui
View author publications
You can also search for this author in PubMed Google Scholar
Bin Yu
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Qin Ma
View author publications
You can also search for this author in PubMed Google Scholar
Hongyan Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bin Yu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, M., Cui, X., Yu, B. et al. SulSite-GTB: identification of protein S-sulfenylation sites by fusing multiple feature information and gradient tree boosting. Neural Comput & Applic 32, 13843–13862 (2020). https://doi.org/10.1007/s00521-020-04792-z

Download citation

Received: 05 February 2019
Accepted: 17 February 2020
Published: 13 March 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s00521-020-04792-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SulSite-GTB: identification of protein S-sulfenylation sites by fusing multiple feature information and gradient tree boosting

Abstract

Access this article

Similar content being viewed by others

SOHSite: incorporating evolutionary information and physicochemical properties to identify protein S-sulfenylation sites

SVM-SulfoSite: A support vector machine based predictor for sulfenylation sites

SIMLIN: a bioinformatics tool for prediction of S-sulphenylation in the human proteome based on multi-stage ensemble-learning models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SulSite-GTB: identification of protein S-sulfenylation sites by fusing multiple feature information and gradient tree boosting

Abstract

Access this article

Similar content being viewed by others

SOHSite: incorporating evolutionary information and physicochemical properties to identify protein S-sulfenylation sites

SVM-SulfoSite: A support vector machine based predictor for sulfenylation sites

SIMLIN: a bioinformatics tool for prediction of S-sulphenylation in the human proteome based on multi-stage ensemble-learning models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation