Using random forest algorithm to predict super-secondary structure in proteins

Hu, Xiu-zhen; Long, Hai-xia; Ding, Chang-jiang; Gao, Su-juan; Hou, Rui

doi:10.1007/s11227-018-2531-2

Using random forest algorithm to predict super-secondary structure in proteins

Published: 17 August 2018

Volume 76, pages 3199–3210, (2020)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Xiu-zhen Hu¹,
Hai-xia Long^1,2,
Chang-jiang Ding¹,
Su-juan Gao¹ &
…
Rui Hou¹

245 Accesses
5 Citations
Explore all metrics

Abstract

A new super-secondary structure dataset with sequence identity < 25%, resolution better than 2.0 Å, which contained 2805 non-redundant protein chains and could be classified into four motifs, is built for prediction. The matrix scoring values and hydropathy distribution are extracted from the protein sequences, which are then input to the random forest algorithm to predict the super-secondary structure in proteins. The predictive overall accuracy is 72.71% and 68.54% by fivefold cross-validation and independent dataset test, respectively. The proposed method is also tested on a previous independent test dataset and the predictive overall accuracy 84.89%, which is better than the performance of the previous predictions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A two-stage approach towards protein secondary structure classification

Article 29 May 2020

Ranking near-native candidate protein structures via random forest classification

Article Open access 24 December 2019

Predicting the Outer/Inner BetaStrands in Protein Beta Sheets Based on the Random Forest Algorithm

References

Cao XY, Hu XZ, Zhang XJ, Gao SJ, Ding CJ, Feng YE, Bao WH (2017) Identification of metal ion binding sites based on amino acid sequences. PLoS ONE 12(8):e0183756
Article Google Scholar
Conde L, Vaquerizas JM, Dopazo H, Arbiza L, Reumers J, Rousseau F et al (2006) PupaSuite: finding functional single nucleotide polymorphisms for large-scale genotyping purposes. Nucleic Acids Res 34(34):621–625
Article Google Scholar
Levy R, Sobolev V, Edelman M (2011) First and second shell metal binding residues in human proteins are disproportionately associated with disease related SNPs. Hum Mutat 32(11):1309–1318
Article Google Scholar
Gurunath R, Beena TK, Adiga PR, Balaram P (1995) Enhancing peptide antigenicity by helix stabilization. FEBS Lett 361:176–178
Article Google Scholar
Sun ZR, Rao XQ, Peng LW, Xu D (1997) Prediction of protein supersecondary structures based on the artificial neural network method. Protein Eng 10:763–769
Article Google Scholar
Hu XZ, Li QZ (2006) The protein super-secondary structure recognition with the method of diversity measure. Acta Biophysica Sinica 13(6):424–428
Google Scholar
Hu XZ, Li QZ (2008) Prediction of the β-Hairpins in protein using support vector machine. Protein J 27:115–122
Article Google Scholar
Zou DH, He ZS, He JY, Xia YX (2011) Supersecondary structure prediction using chou’s pseudo amino acid composition. J Comput Chem 32:271–278
Article Google Scholar
Mao WZ, Wang T, Zhang W, Gong H (2018) Identification of residue pairing in interacting β-strands from a predicted residue contact map. BMC Bioinform 146:1–19
Google Scholar
Kabsch W, Sander C (1983) Dictionary of protein secondary structure: Pattern recognition of hydrogen- bonded and geometrical features. Biopolymers 22:2577–2637
Article Google Scholar
Oliva B, Bates PA, Querol E, Avilés FX, Sternberg MJ (1997) An automated classification of the StrucT of protein loops. J Mol Biol 266:814–830
Article Google Scholar
Espadaler J, Fuentes NF, Hermoso A, Querol E, Aviles FX, Sternberg MJE, Oliva B (2004) ArchDB: automated protein loop classification as a tool for structural genomics. Nucleic Acids Res. 32:185–188
Article Google Scholar
Jaume B, Joan PI, Javier GG, Manuel A, Narcis ML, Fuentes F, Oliva B (2014) ArchDB 2014: structural classification of loops in proteins. Nucleic Acids Res 42:315–319
Article Google Scholar
Leo B (2001) Random Forest. Statistics. Department University of California Berkeley, CA, vol 94720, pp 1–2
Li C, Wang XF, Chen Z, Zhang ZD, Song JN (2015) Computational characterization of parallel dimeric and trimeric coiled-coils using effective amino acid indices. Mol Biosyst 11(2):354–360
Article Google Scholar
Song JN, Li FY, Takemoto K, Haffari G, Akutsu T, Chou KC, Webb GI (2018) PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework. J Theor Biol 443:125–137
Article Google Scholar
Okun O, Priisalu H (2007) Random forest for gene expression based cancer classification: overlooked issues. Pattern Recognit Image Anal 4478:483–490
Article Google Scholar
Jia SC, Hu XZ (2011) Using random forest algorithm to predict β-hairpin motifs. Protein Pept Lett 18(6):609–617
Article Google Scholar
Richa T, Ide S, Suzuki R, Ebina T, Kuroda Y (2017) Fast H-DROP: a thirty times accelerated version of H-DROP for interactive SVM-based prediction of helical domain linkers. J Comput Aided Mol Des 31(2):237–244
Article Google Scholar
Liu ZP, Wu LY, Wang Y, Zhang XS, Chen LN (2010) Prediction of protein–RNA binding sites by a random forest method with combined features. Bioinformatics 26(13):1616–1622
Article Google Scholar
Pánek J, Eidhammer I, Aasland R (2005) A new method for identification of protein (sub) families in a set of proteins based on hydropathy distribution in proteins. Proteins Struct Funct Bioinform 58:923–934
Article Google Scholar
Kel AE, GoBling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E (2003) MATCHTM: a tool for searching transcription factor binding sites in DNA sequences. Nucleic Acid Res 31(13):3576–3579
Article Google Scholar
Quandt K, Frech K, Karas H, Wingender E, Werner T (1995) MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res 23:4878–4884
Article Google Scholar
Cartharius K, Frech K, Grote K, Klocke B, Haltmeier M, Klingenhoff A, Frisch M, Bayerlein M, Werner T (2005) MatInspector and beyond: promoter analysis based on transcription factor binding sites. Bioinformatics 13:2933–2942
Article Google Scholar
Kumar M, Bhasin M, Natt NK, Raghava GPS (2005) BhairPred: prediction of b-hairpins in a protein from multiple alignment information using ANN and SVM techniques. Nucleic Acids Res 33:154–155
Article Google Scholar
Kuhn M, Meiler J, Baker D (2004) Strand-loop-strand motifs: prediction of hairpins and diverging turns in proteins. Proteins Struct Funct Bioinform 54:282–288
Article Google Scholar

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (31260203, 51467015) and Natural Science Foundation of the Inner Mongolia of China (2016MS0378).

Author information

Authors and Affiliations

College of Sciences, Inner Mongolia University of Technology, Hohhot, 010051, People’s Republic of China
Xiu-zhen Hu, Hai-xia Long, Chang-jiang Ding, Su-juan Gao & Rui Hou
College of Physics and Electronic Information, Hulunbuir University, Hulunbuir, 021008, People’s Republic of China
Hai-xia Long

Authors

Xiu-zhen Hu
View author publications
You can also search for this author in PubMed Google Scholar
Hai-xia Long
View author publications
You can also search for this author in PubMed Google Scholar
Chang-jiang Ding
View author publications
You can also search for this author in PubMed Google Scholar
Su-juan Gao
View author publications
You can also search for this author in PubMed Google Scholar
Rui Hou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiu-zhen Hu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, Xz., Long, Hx., Ding, Cj. et al. Using random forest algorithm to predict super-secondary structure in proteins. J Supercomput 76, 3199–3210 (2020). https://doi.org/10.1007/s11227-018-2531-2

Download citation

Published: 17 August 2018
Issue Date: May 2020
DOI: https://doi.org/10.1007/s11227-018-2531-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using random forest algorithm to predict super-secondary structure in proteins

Abstract

Access this article

Similar content being viewed by others

A two-stage approach towards protein secondary structure classification

Ranking near-native candidate protein structures via random forest classification

Predicting the Outer/Inner BetaStrands in Protein Beta Sheets Based on the Random Forest Algorithm

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using random forest algorithm to predict super-secondary structure in proteins

Abstract

Access this article

Similar content being viewed by others

A two-stage approach towards protein secondary structure classification

Ranking near-native candidate protein structures via random forest classification

Predicting the Outer/Inner BetaStrands in Protein Beta Sheets Based on the Random Forest Algorithm

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation