Predicting protein subnuclear location with optimized evidence-theoretic K-nearest classifier and pseudo amino acid composition

https://doi.org/10.1016/j.bbrc.2005.09.117Get rights and content

Abstract

The nucleus is the brain of eukaryotic cells that guides the life processes of the cell by issuing key instructions. For in-depth understanding of the biochemical process of the nucleus, the knowledge of localization of nuclear proteins is very important. With the avalanche of protein sequences generated in the post-genomic era, it is highly desired to develop an automated method for fast annotating the subnuclear locations for numerous newly found nuclear protein sequences so as to be able to timely utilize them for basic research and drug discovery. In view of this, a novel approach is developed for predicting the protein subnuclear location. It is featured by introducing a powerful classifier, the optimized evidence-theoretic K-nearest classifier, and using the pseudo amino acid composition [K.C. Chou, PROTEINS: Structure, Function, and Genetics, 43 (2001) 246], which can incorporate a considerable amount of sequence-order effects, to represent protein samples. As a demonstration, identifications were performed for 370 nuclear proteins among the following 9 subnuclear locations: (1) Cajal body, (2) chromatin, (3) heterochromatin, (4) nuclear diffuse, (5) nuclear pore, (6) nuclear speckle, (7) nucleolus, (8) PcG body, and (9) PML body. The overall success rates thus obtained by both the re-substitution test and jackknife cross-validation test are significantly higher than those by existing classifiers on the same working dataset. It is anticipated that the powerful approach may also become a useful high throughput vehicle to bridge the huge gap occurring in the post-genomic era between the number of gene sequences in databases and the number of gene products that have been functionally characterized. The OET-KNN classifier will be available at www.pami.sjtu.edu.cn/people/hbshen.

Section snippets

Materials and methods

The proteins used for this study were collected from the NPD (Nuclear Protein Database) [12] at http://npd.hgu.mrc.ac.uk/. The sequences of proteins in NPD are derived from the SWISS-PROT and TREMBL Data Banks [13]. To construct a high-quality working dataset, all the data were screened strictly according to the following procedures. (1) Included were only those sequences with a clear locational description in the nucleus. (2) For protein sequences having the same name but from different

Results and discussion

The predictions were examined by re-substitution test and jackknife test on the 370 proteins classified into 9 subnuclear locations (Table 1). The re-substitution test is used to examine the self-consistency of a prediction method [21], while the jackknife test is deemed the most objective and rigorous procedure for cross-validation [21] and has been used by more and more investigators [8], [9], [11], [22], [23], [24], [25], [26], [27] to examine the power of various prediction methods.

For the

Conclusion

The OET-KNN algorithm is a very powerful classifier. Using pseudo amino acid composition to represent protein samples can incorporate a considerable amount of sequence-order effects that are totally omitted by the conventional amino acid composition. That is why the current approach, which has combined the two advantages, can significantly outperform the other approaches, such as ProtLock and SVM. It is anticipated that with the improvement of the training dataset as more proteins with known

References (30)

  • K.C. Chou, Prediction of protein cellular attributes using pseudo amino acid composition, PROTEINS: Structure,...
  • Y.X. Pan et al.

    Application of pseudo amino acid composition for predicting protein subcellular location: stochastic signal processing approach

    J. Protein Chem.

    (2003)
  • G.P. Zhou et al.

    Subcellular location prediction of apoptosis proteins

    PROTEINS: Struct. Funct. Genet.

    (2003)
  • K.C. Chou, Y.D. Cai, Prediction and classification of protein subcellular location: sequence-order effect and pseudo...
  • X. Xiao et al.

    Using complexity measure factor to predict protein subcellular location

    Amino Acids

    (2005)
  • Cited by (148)

    • Discovering nuclear targeting signal sequence through protein language learning and multivariate analysis

      2020, Analytical Biochemistry
      Citation Excerpt :

      A nuclear localization signal or sequence (NLS) is an amino acid sequence peptide that binds to a protein sequence for the introduction of a nuclear protein into the nucleus [1–3].

    • Accurate prediction of protein structural classes by incorporating PSSS and PSSM into Chou's general PseAAC

      2015, Chemometrics and Intelligent Laboratory Systems
      Citation Excerpt :

      In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, sub-sampling test, and jackknife test [60]. However, as elucidated in [38] and demonstrated by Eqs. (28)–(32) of [38], among the three cross-validation methods, the jackknife test is deemed the least arbitrary (most objective) that can always yield a unique result for a given benchmark dataset, and hence has been increasingly used and widely recognized by investigators to examine the accuracy of various predictors (see, e.g., [46,52–55,61–70]). Accordingly, the jackknife test is also adopted here to examine the quality of the present predictor.

    View all citing articles on Scopus

    Abbreviations: KNN, K-nearest neighbors; ET-KNN, evidence theoretic KNN; OET-KNN, optimized evidence theoretic KNN; NPD, Nuclear Protein Database.

    View full text