Skip to main content
Log in

Using random forest algorithm to predict super-secondary structure in proteins

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

A new super-secondary structure dataset with sequence identity < 25%, resolution better than 2.0 Å, which contained 2805 non-redundant protein chains and could be classified into four motifs, is built for prediction. The matrix scoring values and hydropathy distribution are extracted from the protein sequences, which are then input to the random forest algorithm to predict the super-secondary structure in proteins. The predictive overall accuracy is 72.71% and 68.54% by fivefold cross-validation and independent dataset test, respectively. The proposed method is also tested on a previous independent test dataset and the predictive overall accuracy 84.89%, which is better than the performance of the previous predictions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Cao XY, Hu XZ, Zhang XJ, Gao SJ, Ding CJ, Feng YE, Bao WH (2017) Identification of metal ion binding sites based on amino acid sequences. PLoS ONE 12(8):e0183756

    Article  Google Scholar 

  2. Conde L, Vaquerizas JM, Dopazo H, Arbiza L, Reumers J, Rousseau F et al (2006) PupaSuite: finding functional single nucleotide polymorphisms for large-scale genotyping purposes. Nucleic Acids Res 34(34):621–625

    Article  Google Scholar 

  3. Levy R, Sobolev V, Edelman M (2011) First and second shell metal binding residues in human proteins are disproportionately associated with disease related SNPs. Hum Mutat 32(11):1309–1318

    Article  Google Scholar 

  4. Gurunath R, Beena TK, Adiga PR, Balaram P (1995) Enhancing peptide antigenicity by helix stabilization. FEBS Lett 361:176–178

    Article  Google Scholar 

  5. Sun ZR, Rao XQ, Peng LW, Xu D (1997) Prediction of protein supersecondary structures based on the artificial neural network method. Protein Eng 10:763–769

    Article  Google Scholar 

  6. Hu XZ, Li QZ (2006) The protein super-secondary structure recognition with the method of diversity measure. Acta Biophysica Sinica 13(6):424–428

    Google Scholar 

  7. Hu XZ, Li QZ (2008) Prediction of the β-Hairpins in protein using support vector machine. Protein J 27:115–122

    Article  Google Scholar 

  8. Zou DH, He ZS, He JY, Xia YX (2011) Supersecondary structure prediction using chou’s pseudo amino acid composition. J Comput Chem 32:271–278

    Article  Google Scholar 

  9. Mao WZ, Wang T, Zhang W, Gong H (2018) Identification of residue pairing in interacting β-strands from a predicted residue contact map. BMC Bioinform 146:1–19

    Google Scholar 

  10. Kabsch W, Sander C (1983) Dictionary of protein secondary structure: Pattern recognition of hydrogen- bonded and geometrical features. Biopolymers 22:2577–2637

    Article  Google Scholar 

  11. Oliva B, Bates PA, Querol E, Avilés FX, Sternberg MJ (1997) An automated classification of the StrucT of protein loops. J Mol Biol 266:814–830

    Article  Google Scholar 

  12. Espadaler J, Fuentes NF, Hermoso A, Querol E, Aviles FX, Sternberg MJE, Oliva B (2004) ArchDB: automated protein loop classification as a tool for structural genomics. Nucleic Acids Res. 32:185–188

    Article  Google Scholar 

  13. Jaume B, Joan PI, Javier GG, Manuel A, Narcis ML, Fuentes F, Oliva B (2014) ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res 42:315–319

    Article  Google Scholar 

  14. Leo B (2001) Random Forest. Statistics. Department University of California Berkeley, CA, vol 94720, pp 1–2

  15. Li C, Wang XF, Chen Z, Zhang ZD, Song JN (2015) Computational characterization of parallel dimeric and trimeric coiled-coils using effective amino acid indices. Mol Biosyst 11(2):354–360

    Article  Google Scholar 

  16. Song JN, Li FY, Takemoto K, Haffari G, Akutsu T, Chou KC, Webb GI (2018) PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework. J Theor Biol 443:125–137

    Article  Google Scholar 

  17. Okun O, Priisalu H (2007) Random forest for gene expression based cancer classification: overlooked issues. Pattern Recognit Image Anal 4478:483–490

    Article  Google Scholar 

  18. Jia SC, Hu XZ (2011) Using random forest algorithm to predict β-hairpin motifs. Protein Pept Lett 18(6):609–617

    Article  Google Scholar 

  19. Richa T, Ide S, Suzuki R, Ebina T, Kuroda Y (2017) Fast H-DROP: a thirty times accelerated version of H-DROP for interactive SVM-based prediction of helical domain linkers. J Comput Aided Mol Des 31(2):237–244

    Article  Google Scholar 

  20. Liu ZP, Wu LY, Wang Y, Zhang XS, Chen LN (2010) Prediction of protein–RNA binding sites by a random forest method with combined features. Bioinformatics 26(13):1616–1622

    Article  Google Scholar 

  21. Pánek J, Eidhammer I, Aasland R (2005) A new method for identification of protein (sub) families in a set of proteins based on hydropathy distribution in proteins. Proteins Struct Funct Bioinform 58:923–934

    Article  Google Scholar 

  22. Kel AE, GoBling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E (2003) MATCHTM: a tool for searching transcription factor binding sites in DNA sequences. Nucleic Acid Res 31(13):3576–3579

    Article  Google Scholar 

  23. Quandt K, Frech K, Karas H, Wingender E, Werner T (1995) MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res 23:4878–4884

    Article  Google Scholar 

  24. Cartharius K, Frech K, Grote K, Klocke B, Haltmeier M, Klingenhoff A, Frisch M, Bayerlein M, Werner T (2005) MatInspector and beyond: promoter analysis based on transcription factor binding sites. Bioinformatics 13:2933–2942

    Article  Google Scholar 

  25. Kumar M, Bhasin M, Natt NK, Raghava GPS (2005) BhairPred: prediction of b-hairpins in a protein from multiple alignment information using ANN and SVM techniques. Nucleic Acids Res 33:154–155

    Article  Google Scholar 

  26. Kuhn M, Meiler J, Baker D (2004) Strand-loop-strand motifs: prediction of hairpins and diverging turns in proteins. Proteins Struct Funct Bioinform 54:282–288

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (31260203, 51467015) and Natural Science Foundation of the Inner Mongolia of China (2016MS0378).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiu-zhen Hu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, Xz., Long, Hx., Ding, Cj. et al. Using random forest algorithm to predict super-secondary structure in proteins. J Supercomput 76, 3199–3210 (2020). https://doi.org/10.1007/s11227-018-2531-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-018-2531-2

Keywords

Navigation