Prediction of subcellular location of apoptosis proteins combining tri-gram encoding based on PSSM and recursive feature elimination
Introduction
Apoptosis, also known as programmed cell death, plays a critical role in many important biological processes such as morphogenesis, tissue homeostasis, the elimination of damaged or virally infected cells, and elimination of self-reactive clones from the immune system (Steller, 1995). Since the function of apoptosis proteins is demonstrated to be correlated with their subcellular location, the information about subcellular location of apoptosis proteins can be very useful in understanding their role in the process of programmed cell death and revealing the apoptosis mechanism (Zhou and Doctor, 2003). The current experimental determination of subcellular location is costly and time-consuming and thus not capable of completely meeting researchers’ demands. With the ever increasing sequence data, there exists a great challenge to develop reliable and accurate computational methods to predict subcellular location of apoptosis proteins from their primary sequences.
Compared to the immense work on protein subcellular location prediction (Chou and Shen, 2007), studies on predicting apoptosis protein subcellular location are limited. It may be due to limited number of experimentally validated apoptosis proteins in the database. Generally, these methods have two major tasks: (1) the design of the protein encoding scheme or feature extraction; (2) the selection of the classifier or predictor. For the former task, several sequence features have been applied to represent protein sequences, including amino acid composition (Zhou et al., 2008), pseudo-amino acid composition (Chen and Li, 2007a, Chen and Li, 2007b, Ding and Zhang, 2008, Jiang et al., 2008, Liao et al., 2011, Lin et al., 2009, Yu et al., 2012), grouped weight encoding (Zhang et al., 2006), wavelet coefficients (Qiu et al., 2010) and distance frequency (Zhang et al., 2009). For the latter task, several machine learning algorithms have been used to perform the prediction, such as covariant discriminant (Zhou and Doctor, 2003), fuzzy k-nearest neighbor (FKNN) (Ding and Zhang, 2008, Jiang et al., 2008), support vector machine (SVM) (Huang and Shi, 2005, Liu et al., 2010, Qiu et al., 2010, Zhang et al., 2009) and ensemble classifier (Gu et al., 2010, Saravanan and Lakshmi, 2013). Among these classifiers, SVM exhibits quite promising results (Liu et al., 2010).
Various SVM-based methods for identifying protein attributes differ only in the protein sequence encoding schemes used. In this study, we focus on developing the feature extraction technique to predict subcellular location of apoptosis proteins. A novel representation based on position-specific score matrix (PSSM) is firstly introduced, which incorporates evolution information represented in the PSI-BLAST profile and sequence-order information by computing tri-grams. The 20 amino acids in organism will produce 8000 combinations of amino acid triplets (or tri-grams), giving an 8000-dimensional feature vector for a given protein sequence. Then, recursive feature elimination by linear support vector machine (SVM-RFE) is applied for feature selection and reduced vectors are input to an SVM classifier to perform the prediction. The proposed representation is shown to substantially improve the prediction performance of apoptosis proteins subcellular location. The source code for implementing the algorithm and the datasets used in this study are freely available to the academic community at http://xxxy.shou.edu.cn/bioinform/SubLoc-Trigram/index.html.
Section snippets
Datasets
Two datasets, ZD98 (Zhou and Doctor, 2003) and ZW225 (Zhang et al., 2006), are used to demonstrate the performance of the proposed method. The ZD98 dataset contains 43 cytoplasmic proteins, 30 plasma membrane-bound proteins, 13 mitochondrial proteins and 12 other proteins. The ZW225 dataset consists of 41 nuclear proteins, 70 cytoplasmic proteins, 25 mitochondrial proteins and 89 membrane proteins. Although two datasets have small size, they were widely used in previous studies. To further
Effect of top K features
By computing tri-gram features, we firstly obtain an 8000-dimensional feature vector for each protein. Then we apply SVM-RFE to rank these vectors according to their importance. To further determine the optimal accuracy and corresponding dimensions, we calculate the overall accuracies for top K features using the jackknife cross-validation, where K=10, 20, 30, …, 300. The results on ZW225 and CL317 datasets are shown in Fig. 1. To make the figure clearly visible, the results on ZD98 dataset are
Conclusions
In this study, we focus on the design of a high-quality sequence encoding scheme for predicting subcellular location of apoptosis proteins. By integrating evolution information and sequence-order information, PSSM-based tri-gram encoding scheme is firstly introduced to transform the PSSM profiles of proteins into 8000-dimensional feature vectors. Then, SVM-RFE algorithm is adopted to reduce feature abundance and computation complexity. Finally, the optimal 70 features are selected to perform
Acknowledgments
This work was partially supported by the Innovation Program of Shanghai Municipal Education Commission (No. 13YZ098), the Foundation for University Youth Teachers of Shanghai (No. ZZhy12028), the National Natural Science Foundation of China (No. 41376135), the Doctoral Fund of Ministry of Education of China (No. 20133104110006) and the Doctoral Fund of Shanghai Ocean University.
References (24)
- et al.
Prediction of the subcellular location of apoptosis proteins
J. Theor. Biol.
(2007) - et al.
Prediction of apoptosis protein subcellular location using improved hybrid approach and pseudo-amino acid composition
J. Theor. Biol.
(2007) - et al.
Recent progress in protein subcellular location prediction
Anal. Biochem.
(2007) - et al.
Using Chou׳s pseudo amino acid composition to predict subcellular localization of apoptosis proteins: an approach with immune genetic algorithm-based ensemble classifier
Pattern Recogn. Lett
(2008) - et al.
A novel representation for apoptosis protein subcellular localization prediction using support vector machine
J. Theor. Biol.
(2009) - et al.
A novel method for apoptosis protein subcellular localization prediction combining encoding based on grouped weight and support vector machine
FEBS Lett.
(2006) - et al.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Res.
(1997) - et al.
LIBSVM: a library for support vector machines
ACM Trans. Intell. Syst. Technol.
(2011) - et al.
Prediction of protein structural class using novel evolutionary collocation-based sequence representation
J. Comput. Chem.
(2008) - et al.
Prediction of protein structural classes
Crit. Rev. Biochem. Mol. Biol.
(1995)
Prediction of subcellular location apoptosis proteins with ensemble classifier and feature selection
Amino Acids
Gene selection for cancer classification using support vector machines
Mach. Learn.
Cited by (20)
Prediction subcellular localization of Gram-negative bacterial proteins by support vector machine using wavelet denoising and Chou's pseudo amino acid composition
2017, Chemometrics and Intelligent Laboratory SystemsCitation Excerpt :Yu et al. [25] used amino acid substitution matrix and auto covariance transformation to extract the sequence features of proteins, constructed the feature vectors and proposed a new pseudo-amino acid model to predict the subcellular localization of apoptosis proteins. Liu et al. [31] proposed a method for predicting the subcellular localization of apoptosis proteins based on tri-gram encoding of PSSM matrix, which incorporates evolution information of proteins. Li et al. [32] proposed a novel PseAAC model to predict the subcellular localization of three bacterial proteins by fusing features from PSSM matrix, GO information, and PROFEAT.
Identification of gene markers in the development of smoking-induced lung cancer
2016, GeneCitation Excerpt :By integrating all pathway features with clinical features (such as age, sex, smoked years, pack-years), we obtained functional features formed by the clinical diagnostic criteria and abnormal pathways. Then, the recursive feature elimination (RFE) method (Liu et al., 2015) was used to optimize and screen the functional features to get the best prediction accuracy. And python-sklearn was used to perform RFE.
Prediction of feature genes in trauma patients with the TNF rs1800629 A allele using support vector machine
2015, Computers in Biology and MedicineCitation Excerpt :Genes with FDR less than 0.05 and fold change (FC, mutation/wild type) more than 1.5 were considered as significant. SVM is employed as the predictor, which has proven to be a powerful machine learning technique, especially for classification [19,20]. In this study, we performed our classification analyses using a SVM classifier (kernel: radial basis, gamma=0.0075, C parameter default value=1) based on feature genes associated with the TNF rs1800629 A allele.
Effects of genetic variation on the structure of RNA and protein
2024, ProteomicsImproved multi-label classifiers for predicting protein subcellular localization
2024, Mathematical Biosciences and EngineeringSupport vector machine for lung adenocarcinoma staging through variant pathways
2020, G3: Genes, Genomes, Genetics