Predicting protein subnuclear location with optimized evidence-theoretic K-nearest classifier and pseudo amino acid composition

doi:10.1016/j.bbrc.2005.09.117

Biochemical and Biophysical Research Communications

Volume 337, Issue 3, 25 November 2005, Pages 752-756

https://doi.org/10.1016/j.bbrc.2005.09.117 Get rights and content

Abstract

The nucleus is the brain of eukaryotic cells that guides the life processes of the cell by issuing key instructions. For in-depth understanding of the biochemical process of the nucleus, the knowledge of localization of nuclear proteins is very important. With the avalanche of protein sequences generated in the post-genomic era, it is highly desired to develop an automated method for fast annotating the subnuclear locations for numerous newly found nuclear protein sequences so as to be able to timely utilize them for basic research and drug discovery. In view of this, a novel approach is developed for predicting the protein subnuclear location. It is featured by introducing a powerful classifier, the optimized evidence-theoretic K-nearest classifier, and using the pseudo amino acid composition [K.C. Chou, PROTEINS: Structure, Function, and Genetics, 43 (2001) 246], which can incorporate a considerable amount of sequence-order effects, to represent protein samples. As a demonstration, identifications were performed for 370 nuclear proteins among the following 9 subnuclear locations: (1) Cajal body, (2) chromatin, (3) heterochromatin, (4) nuclear diffuse, (5) nuclear pore, (6) nuclear speckle, (7) nucleolus, (8) PcG body, and (9) PML body. The overall success rates thus obtained by both the re-substitution test and jackknife cross-validation test are significantly higher than those by existing classifiers on the same working dataset. It is anticipated that the powerful approach may also become a useful high throughput vehicle to bridge the huge gap occurring in the post-genomic era between the number of gene sequences in databases and the number of gene products that have been functionally characterized. The OET-KNN classifier will be available at www.pami.sjtu.edu.cn/people/hbshen.

Section snippets

Materials and methods

The proteins used for this study were collected from the NPD (Nuclear Protein Database) [12] at http://npd.hgu.mrc.ac.uk/. The sequences of proteins in NPD are derived from the SWISS-PROT and TREMBL Data Banks [13]. To construct a high-quality working dataset, all the data were screened strictly according to the following procedures. (1) Included were only those sequences with a clear locational description in the nucleus. (2) For protein sequences having the same name but from different

Results and discussion

The predictions were examined by re-substitution test and jackknife test on the 370 proteins classified into 9 subnuclear locations (Table 1). The re-substitution test is used to examine the self-consistency of a prediction method [21], while the jackknife test is deemed the most objective and rigorous procedure for cross-validation [21] and has been used by more and more investigators [8], [9], [11], [22], [23], [24], [25], [26], [27] to examine the power of various prediction methods.

For the

Conclusion

The OET-KNN algorithm is a very powerful classifier. Using pseudo amino acid composition to represent protein samples can incorporate a considerable amount of sequence-order effects that are totally omitted by the conventional amino acid composition. That is why the current approach, which has combined the two advantages, can significantly outperform the other approaches, such as ProtLock and SVM. It is anticipated that with the improvement of the training dataset as more proteins with known

References (30)

K. Nakai et al.
A knowledge base for predicting protein localization sites in eukaryotic cells
Genomics
(1992)
K. Nakai
Protein sorting signals and prediction of subcellular localization
Adv. Protein Chem.
(2000)
J. Cedano et al.
Relation between amino acid composition and cellular location of proteins
J. Mol. Biol.
(1997)
K.C. Chou et al.
Using functional domain composition and support vector machines for prediction of protein subcellular location
J. Biol. Chem.
(2002)
J.J. Chou et al.
A joint prediction of the folding types of 1490 human proteins from their genetic codons
J. Theor. Biol.
(1993)
K.C. Chou et al.
Predicting protein folding types by distance functions that make allowances for amino acid interactions
J. Biol. Chem.
(1994)
Z. Yuan
Prediction of protein subcellular locations using Markov chain models
FEBS Lett.
(1999)
M. Wang et al.
SLLE for predicting membrane protein types
J. Theor. Biol.
(2005)
R.F. Murphy et al.
Towards a systematics for protein subcellular location: quantitative description of protein localization patterns and automated analysis of fluorescence microscope images
Proc. Int. Conf. Intell. Syst. Mol. Biol.
(2000)
K.C. Chou et al.
Protein subcellular location prediction
Protein Eng.
(1999)

K.C. Chou, Prediction of protein cellular attributes using pseudo amino acid composition, PROTEINS: Structure,...

Y.X. Pan et al.

Application of pseudo amino acid composition for predicting protein subcellular location: stochastic signal processing approach

J. Protein Chem.

(2003)

G.P. Zhou et al.

Subcellular location prediction of apoptosis proteins

PROTEINS: Struct. Funct. Genet.

(2003)

K.C. Chou, Y.D. Cai, Prediction and classification of protein subcellular location: sequence-order effect and pseudo...

X. Xiao et al.

Using complexity measure factor to predict protein subcellular location

Amino Acids

(2005)

Cited by (148)

Discovering nuclear targeting signal sequence through protein language learning and multivariate analysis
2020, Analytical Biochemistry
Citation Excerpt :
A nuclear localization signal or sequence (NLS) is an amino acid sequence peptide that binds to a protein sequence for the introduction of a nuclear protein into the nucleus [1–3].
Nuclear localization signals (NLSs) are peptides that target proteins to the nucleus by binding to carrier proteins in the cytoplasm that transport their cargo across the nuclear membrane. Accurate identification of NLSs can help elucidate the functions of nuclear protein complexes. The currently known NLS predictors are usually specific to certain species or largely dependent on prior knowledge of NLS basic residues. Thus, a more general predictor is highly desired to reduce the potentially high false positives or false negatives in discovering new NLSs. Here, we report a new method, INSP (Identification Nucleus Signal Peptide), to effectively identify NLS mainly based on statistical knowledge and machine learning algorithms. In our NLS machine learning model, we considered the query protein sequence as text and extracted the sequence context features using a natural language model. These word-vector features encode discriminative knowledge of NLS motif frequency and are thus useful for model recognition. The output of the machine learning model will be fused with statistical knowledge of the query sequence to build a final multivariate regression model for NLS peptide identification. The experimental results demonstrate a promising performance of the new INSP approach. INSP is freely available at: www.csbio.sjtu.edu.cn/bioinf/INSP/for academic use.
pLoc_bal-mHum: Predict subcellular localization of human proteins by PseAAC and quasi-balancing training dataset
2019, Genomics
A cell contains numerous protein molecules. One of the fundamental goals in molecular cell biology is to determine their subcellular locations since this information is extremely important to both basic research and drug development. In this paper, we report a novel and very powerful predictor called “pLoc_bal-mHum” for predicting the subcellular localization of human proteins based on their sequence information alone. Cross-validation tests on exactly the same experiment-confirmed dataset have indicated that the new predictor is remarkably superior to the existing state-of-the-art predictor in identifying the subcellular localization of human proteins. To maximize the convenience for the majority of experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mHum/, by which users can easily get their desired results without the need to go through the detailed mathematics.
EK-NNclus: A clustering procedure based on the evidential K-nearest neighbor rule
2015, Knowledge-Based Systems
We propose a new clustering algorithm based on the evidential K nearest-neighbor (EK-NN) rule. Starting from an initial partition, the algorithm, called EK-NNclus, iteratively reassigns objects to clusters using the EK-NN rule, until a stable partition is obtained. After convergence, the cluster membership of each object is described by a Dempster–Shafer mass function assigning a mass to each cluster and to the whole set of clusters. The mass assigned to the set of clusters can be used to identify outliers. The method can be implemented in a competitive Hopfield neural network, whose energy function is related to the plausibility of the partition. The procedure can thus be seen as searching for the most plausible partition of the data. The EK-NNclus algorithm can be set up to depend on two parameters, the number K of neighbors and a scale parameter, which can be fixed using simple heuristics. The number of clusters does not need to be determined in advance. Numerical experiments with a variety of datasets show that the method generally performs better than density-based and model-based procedures for finding a partition with an unknown number of clusters.
Accurate prediction of protein structural classes by incorporating PSSS and PSSM into Chou's general PseAAC
2015, Chemometrics and Intelligent Laboratory Systems
Citation Excerpt :
In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, sub-sampling test, and jackknife test [60]. However, as elucidated in [38] and demonstrated by Eqs. (28)–(32) of [38], among the three cross-validation methods, the jackknife test is deemed the least arbitrary (most objective) that can always yield a unique result for a given benchmark dataset, and hence has been increasingly used and widely recognized by investigators to examine the accuracy of various predictors (see, e.g., [46,52–55,61–70]). Accordingly, the jackknife test is also adopted here to examine the quality of the present predictor.
Determination of protein structural class using a fast and suitable computational method has become a hot issue in protein science. Prediction of protein structural class for low-similarity sequences remains a challenge problem. In this study, a 111-dimensional feature vector is constructed to predict protein structural classes. Among the 111 features, 100 features based on pseudo-position specific scoring matrix (PsePSSM) are selected to reflect the evolutionary information and the sequence-order information, and the other 11 rational features based on predicted protein secondary structure sequences (PSSS) are designed in the previous works. To evaluate the performance of the proposed method (named by PSSS–PsePSSM), jackknife cross-validation tests are performed on three widely used benchmark datasets: 1189, 25PDB and 640. Our method achieves competitive performance on prediction accuracies, especially for the overall prediction accuracies for datasets 1189, 25PDB and 640, which reach 86.6%, 89.5% and 81.0%, respectively. The PSSS–PsePSSM algorithm also outperforms other existing methods, indicating that our proposed method is a cost-effective computational tool for protein structural class prediction.
iDeepSubMito: identification of protein submitochondrial localization with deep learning
2021, Briefings in Bioinformatics
iDRP-PseAAC: Identification of DNA Replication Proteins Using General PseAAC and Position Dependent Features
2021, International Journal of Peptide Research and Therapeutics

View all citing articles on Scopus

^☆: Abbreviations: KNN, K-nearest neighbors; ET-KNN, evidence theoretic KNN; OET-KNN, optimized evidence theoretic KNN; NPD, Nuclear Protein Database.

View full text

Predicting protein subnuclear location with optimized evidence-theoretic K-nearest classifier and pseudo amino acid composition☆

Abstract

Section snippets

Materials and methods

Results and discussion

Conclusion

Genomics

Adv. Protein Chem.

J. Mol. Biol.

J. Biol. Chem.

J. Theor. Biol.

J. Biol. Chem.

FEBS Lett.

J. Theor. Biol.

Towards a systematics for protein subcellular location: quantitative description of protein localization patterns and automated analysis of fluorescence microscope images

Proc. Int. Conf. Intell. Syst. Mol. Biol.

Protein subcellular location prediction