Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

The influence of dataset homology and a rigorous evaluation strategy on protein secondary structure prediction

Fig 5

Effects of the homology reduction of the PSSM reference dataset on the accuracy of secondary structure prediction.

(A) The layout of datasets. The homology of the PSSM reference dataset was reduced with a series of sequence identity cutoffs; the lowest was 30% because when 25% or 20% were applied, the remaining sequences would be insufficient to sustain the required dataset sizes (Fig 1A). The homology between training/testing query sets and the reference dataset was manipulated to be <20% sequence identities. The inter- and inner-dataset identity cutoff of training and testing query datasets were both 90%, set high for preserving sufficient sequences. (B) The SSP accuracy obtained at different homology levels of the reference dataset. The overfitting of prediction in training and testing was much suppressed because of the fixed low query-reference dataset homology. More importantly, the accuracy increased as the homology of reference sequences lowered. The same conclusion applied to SOV (see S5 Fig). This phenomenon had not been clearly reported before our study. See Fig 6 for advanced tests.

Fig 5

doi: https://doi.org/10.1371/journal.pone.0254555.g005