Prediction of Apoptosis Protein Subcellular Localization with Multilayer Sparse Coding and Oversampling Approach

The prediction of apoptosis protein subcellular localization plays an important role in understanding the progress in cell proliferation and death. Recently computational approaches to this issue have become very popular, since the traditional biological experiments are so costly and time-consuming that they cannot catch up with the growth rate of sequence data anymore. In order to improve the prediction accuracy of apoptosis protein subcellular localization, we proposed a sparse coding method combined with traditional feature extraction algorithm to complete the sparse representation of apoptosis protein sequences, using multilayer pooling based on different sizes of dictionaries to integrate the processed features, as well as oversampling approach to decrease the influences caused by unbalanced data sets. Then the extracted features were input to a support vector machine to predict the subcellular localization of the apoptosis protein. The experiment results obtained by Jackknife test on two benchmark data sets indicate that our method can significantly improve the accuracy of the apoptosis protein subcellular localization prediction.


Introduction
As a basic constituent of organisms, proteins play a critical role in life activities such as metabolism, breeding, growth, and development, especially for the apoptosis protein, which are crucial in the proteomics. Since the functions of an apoptosis protein are closely related to its subcellular location and different kinds of apoptosis proteins can only function in specific subcellular location, it is important to predict the subcellular location of certain apoptosis protein by existing methods, which could not only help us to understand the interactions and properties of apoptosis proteins but also realize the biological pathway involved [1][2][3]. With the application of high-throughput sequencing techniques and the explosion of sequence data volumes, developing an accurate and reliable computational method to predict apoptosis protein subcellular location has been a great challenge for bioinformaticians, accordingly promoting the development of machine learning in this field [4][5][6][7][8].
By the analysis of research status, the improved directions of using machine learning to predict apoptosis protein subcellular location in the past decade can be roughly categorized into two classes: sequence feature extraction and prediction model [5][6][7][8][9][10]. Currently the widely used methods for feature extraction are amino acid composition (AAC) [11,12], pseudo amino acid composition (PseAAC) [13,14], gene ontology (GO) [15,16], position specific scoring matrix (PSSM) [17,18], feature fusion [19,20], and so on. For example, Zhou et al. used the covariant discriminant function based on Mahalanobis distance and Chou's invariance theorem; combining with traditional AAC feature to predict apoptosis protein subcellular location, the prediction result by Jackknife test on data set ZD98 achieved about 72.5% [21]; Wan et al. proposed GOASVM algorithm based on the information of GO term frequencies and distant homologs to represent a protein in general form of PseAAC and got a high accuracy [22]; Chen et al. used the increment of diversity to fuse N-terminal, C-terminal, and hydrophobic features of apoptosis protein sequences, and the accuracies on ZD98 and CH317 were 90.8% and 82.7%, respectively [23]; Zhao et al. combined the bag of words model with PseAAC method, using K-Means to construct the dictionary of sequence features, and obtained a great predictive effect [24]. At the same time, there are also many efforts for the development of prediction model. For example, Wan et al. proposed an adaptive-decision support vector machine classifier through the annotation information 2 BioMed Research International of GO database and realized the prediction of membrane proteins as well as their multifunctional types [25]; Ali et al. extracted the PseAAC features of protein sequences, combining with location voting, k-nearest neighbor and probabilistic neural network to predict the subcellular localization of membrane proteins [26]. Besides, there are also some other prediction models used in this filed such as logistic regression, bayesian classifiers, and long short-term memory [27][28][29].
In the last decade or so, a recent review [30] pointed that a number of web-servers were developed for predicting the subcellular localization of proteins with both single site and multiple sites [31][32][33][34][35][36]. In general, proteins can simultaneously exist in multiple sites. In this study, given that the number of multilabel proteins in the existing apoptosis protein database is not large enough to construct a benchmark data set meaningfully in statistics and for the case of multiple locations, the sequence information is more complex and various than single locations, using oversampling approach to copy sequence feature may generate the inaccurate results, so here we did not consider the situation of multilabel proteins.
To summarize the previous research results, it is not difficult to find that the prediction accuracy is relatively low if merely using simple method such as AAC or PseAAC to extract sequence features for classification; as for the other feature extraction methods, like PSSM or feature fusion, though the prediction effect is better, the extraction process is too complicated and time-consuming for practical application. Given that many former researches have proved that support vector machine is one of the best classifiers for the prediction of protein subcellular localization [5,9,10,14,17,22], in this study, we focus on obtaining a higher prediction accuracy on the premises of simple feature extraction method and support vector machine to predict the subcellular localization of the apoptosis protein, therefore, finding an efficient approach to optimize the traditional sequence-based feature is the key problem to be solved in this paper.
In the study, we proposed a sparse coding method combined with traditional sequence feature extraction algorithm to extract low-level features of the apoptosis protein sequence, using multilayer pooling based on different sizes of dictionaries to integrate the local and holistic features of the sparse representation. Then the support vector machine was used to complete the final prediction. Given that our adopted benchmark data sets are unbalanced which may influence the classification effects of support vector machine [37], we used an oversampling approach to balance the data sets in the study. Compared with other experimental results with the same support vector machine classifier, the experimental results show that the proposed method can not only simplify the feature extraction process and reduce the time and space complexity of the classifier but also reflect the sequence features more comprehensively and improve the classification performance. The detailed descriptions are shown in the following sections.

Materials and Methods
. . Datasets. Two widely used benchmark data sets are adopted in this study: ZD98 and CH317, respectively. The data set ZD98 was constructed by Zhou and Doctor [21]. There are 98 apoptosis protein sequences divided into four kinds of subcellular locations, which are cytoplasmic proteins (Cy), mitochondrial proteins (Mi), membrane proteins (Me), and other proteins (Other). The data set CH317 was constructed by Chen and Li [23] and contains a total of 317 apoptosis protein sequences, in 6 classes of subcellular locations that are secreted proteins (Se), nuclear proteins (Nu), cytoplasmic proteins (Cy), endoplasmic reticulum proteins (En), membrane proteins (Me), and mitochondrial proteins (Mi). Considering that the above data sets are old, we update ZD98 and CH317 data sets with reference to Wang et al. [38] and remove some of the duplicates and error sequences. The specific method is not repeated here. After processing, there were 96 protein sequences remaining in ZD98 data set and 314 protein sequences remaining in CH317 data set. All protein sequences in the above two data sets are from the latest version of the UniProt database (Release 2018 12), and the number of protein sequences in each class of 2 data sets is shown in Table 1.
. . Feature Extraction. In order to set up a more accurate mapping relationship between each protein sequence and its corresponding feature vector, multilayer sparse coding was introduced in this study to find the most essential feature of original protein sequence based on simple feature extraction method. The algorithm mainly includes the following steps: local feature extraction, sparse coding, and pooling. And the process of sparse coding is divided into 2 sections: dictionary learning and sparse representation. Firstly, the protein sequence is segmented into some fragments, and the traditional protein feature extraction algorithm will be used to extract the features of these fragments, which could be applied for the step of dictionary learning. Then these local features are trained to construct a dictionary and the feature representation of original sequence would be sparsely reconstructed by it. The mean pooling is used to reduce the dimensions of the feature matrix, and finally the pooled vectors based on different dictionary sizes would be integrated as the ultimate features of protein sequences. The flow chart of extraction progress is shown in Figure 1.
. . . Local Feature Extraction. Before the step of sparse coding, it is necessary to extract the local features of protein sequence to constitute a training sample set for dictionary learning. Since every protein sequence has the different length and the critical features may be distributed in different positions of the sequence, in this paper, we adopted sliding window segmentation method inspired by Noor to cut all the protein sequences into pieces [39], generating a number of sequence fragments afterwards. The size of sliding window represents the segment length of each protein sequence, and the reference formula is where 1 , 2 , . . . , represent the length of each protein sequence in the whole data set, is the shortest sequence of it, and is the size of sliding window, which indicates that the value of segment length is between /2 and , and the exact value will be selected by the experimental experience.
After the step of segmentation, the existing sequence feature extraction method is used to statistically analyze the component information of sequence fragments and to transform the character sequences into numerical vectors as the local features of the protein. Effective feature extraction method can remarkably increase the final prediction accuracy. Nakashima and Nishikawa [47] firstly associated the amino acid composition (AAC) with the prediction of protein subcellular location in 1994. The AAC coding method was proposed to count the occurrence frequency of each amino acid in the protein sequence, described as follows: where 1 , 2 , 3 , . . . , 20 represent the number of each amino acid in the protein sequence, respectively and the specific explanation is represents the length of each protein sequence, that is, the total number of all the amino acid residues contained. Firstly, 20 amino acids are numbered from 1 to 20, and ( = 1, 2, 3, . . . , 20) describes the frequency of corresponding number appeared in the sequence.
represents each amino acid residue in original sequence, and ( ) represents the amino acid residue which corresponds to the number .
By using AAC to calculate the fragment features of protein sequence P, we can obtain a feature matrix for each original protein sequence constituted by all the AAC features of corresponding fragments. The matrix is shown in where represents the number of fragments cut by a protein sequence, is the feature dimension processed by AAC algorithm, and V represents the probability of occurrence of different amino acid residues. At this time, is 20. Each row of the matrix represents the feature vectors of different sequence fragments in a protein sequence. Generally we choose some of the fragment features as the local features to construct the dictionary, in this paper; since the number of fragments obtained is not very large, in order to get a better feature representation in spar coding, we chose the local features of all the sequence fragments to form a training sample set = [ 1 , 2 , 3 , . . . , ] for dictionary learning, where ∈ ( = 1, 2, 3, . . . , ), represents the feature vector of different protein sequences, that is, the vector in each row in , and is the number of fragments belonged to all of the protein sequences in the data set.
. . . Sparse Coding. Sparse coding is a branch of deep neural networks, and it contains 2 main steps: dictionary learning and sparse representation, respectively [48]. It can extract the detailed features of original data set and decompose the input sample set into a linear combination of multiple primitives. The coefficients of the primitives are the features of input sample. The description can be formulated as where is the matrix of training sample composed by fragment features; = [ 1 , 2 , 3 , . . . , ] ∈ * is the primitive matrix named the dictionary, represents the feature elements of dictionary, is the size of dictionary, is feature dimension processed by AAC algorithm,; = [ 1 , 2 , 3 , . . . , ] ∈ * is the sparse representation of original sample, and represents the sparse coefficient of the i-th feature block in the sparse feature space, that is, the projection of in sparse feature space. is the number of fragments belonged to all of the protein sequences in the data set. The solution of dictionary can be expressed as where ‖ • ‖ 2 represents 2 norm of a vector and ‖ • ‖ 0 is 0 norm of a vector. The constraint in formula above means that the number of nonzero elements in needs to be less than or equal to 0 , which is preset and related to the sparse rate. Equation (6) is essentially a nonconvex optimization problem. There are mainly two common solutions for it: the first is to transform it into a convex optimization problem to relax the constraint of equation and then transforms it into the following form: where is the balance factor and ‖ • ‖ 1 represents 1 norm of a vector. Equation (7) can usually be solved by regression algorithm, such as LASSO [49]. The second is to solve it by using the heuristic greedy algorithm [50]. The algorithms for the second solution are MOD and K-SVD [51]. In this study, in view of the efficiency and operability of the algorithm, we choose K-SVD as our solution to learn the dictionary; that is, the second solution.K-SVD is an expansion of Kmeans algorithm proposed by Aharon and Elad [52]. It adopts the method of iterative alternating learning and uses the singular value decomposition to perform times iterations to optimize the primitives of dictionary, which can better fit the original data. K-SVD is mainly divided into the following steps: (1) Initialize the dictionary , and set the terminal condition of iteration; (2) Fix , solve the sparse representation ; (3) Fix , solve the dictionary ; (4) Perform steps (2) and (3) alternately until the end of the iteration.
After obtaining the dictionary, the orthogonal matching pursuit (OMP) algorithm is used to complete the sparse representation of the fragment features of the original protein sequence [53]. The basic theory of OMP is to select one of the most matching primitives from the dictionary to perform a sparse approximation with the primitives of original samples and to obtain the residual between them. Then, it continues to select the next proper primitive which is best matched with this signal residual and iterates in this way over and over until the residual and sparse rate meets the fixed terminal conditions. Samples can be approximately presented by a linear combination of these derived primitives. All primitives selected in each process must be orthogonalized first, which would make the convergence speed faster [54]. Constituting the sparse features of all the encoded fragments, we can obtain an * sparse matrix to represent the feature of each protein sequence, where is the number of sequence segments in each sequence and is the size of dictionary, that is, the sparse representation of a protein sequence.
. . . Multilayer Pooling. The dimension of the feature matrix obtained by sparse coding is very high, if we expand it directly, the huge data volume will cause redundant space and time complexities of classification, and it is prone to overfitting. Therefore, it is necessary to reduce the dimensions of the feature matrix. The method of pooling can map the collection of feature vectors into a single vector. There are two different common pooling methods that are the max pooling and mean pooling, respectively. The aggregation statistics of features in different positions can extract the effective information and reduce the calculated amount of numerical matrix [55]. Max pooling takes the maximum value of the feature points in the neighborhood and retains the edge information of the feature matrix more, while mean pooling takes the average value of the feature points in the neighborhood and more to extract the background information [56]. Given that the characters of sequence data are different from images, we chose the mean pooling as the final dimension-reduced method. The formula is shown as follows: where = 1, 2, 3, . . . , being obtained by averaging the elements in the i-th row of the matrix . After being processed by mean pooling, each protein sequence is represented as a dimensional feature vector, is the size of dictionary.
In order to obtain a more overall feature representation of original protein sequence, multilayer pooling based on different sizes of dictionary is performed, and several pooling results will be integrated to help extract the local and holistic features severally. The specific description is as follows: in the process of sparse coding, the values of dictionary sizes are set to 1 , 2 , and 3 respectively; thus 3 different levels of dictionary could be obtained by K-SVD algorithm. Then the OMP algorithm is used to complete the sparse representation of fragment features based on different dictionary sizes, and the sparse features are combined to obtain the feature matrix of original sequence. Finally the sparse matrix will be mean pooled to extract different levels of feature vectors. The vectors in each pooled block are concatenated together to obtain a 1 + 2 + 3 dimensional vector as the final feature representation. In this paper, the values of were set to 30, 50, and 70, respectively, generating a 150 dimensional vector to be selected by principal component analysis (PCA) and sent to the classifier for prediction. The general descriptions of spare coding and pooling can be shown in Figure 2. . . Oversampling Method. Since the data sets used in this paper are not balanced, which may cause the low accuracy of prediction, we referred to some similar papers used the oversampling to balance the data set [16,30,43]. In order to further illustrate the effect of our method, a simple oversampling method called synthetic minority oversampling technique (SMOTE) was applied in the study to decrease the imbalance of our data set. SMOTE is a classical oversampling method proposed by Chawla et al. [57]. It is widely used for its good classification effect and simple operation. The basic principle of SMOTE algorithm is to synthesize new minority samples between a few neighbouring samples and to reduce the imbalance of the data distribution. The details are as follows: (1) For each sample in the class of smaller number of data set, calculate the Euclidean distance from other samples in the minority class to obtain the nearest neighbor samples.
(2) Assuming that the sampling magnification is , for each of the few classes of samples , ( > ) samples are randomly selected from their nearest neighbor samples and these samples are recorded as 1 , 2 , 3 , . . . , .
(3) According to the following, combine each sample with samples to perform random interpolation operations to synthesize interpolated samples : where rand(0, 1) represents a random number within (0, 1) and represents the i-th nearest neighbor sample of . (4) Finally, the interpolated sample is added to the original sample set to form a new sample set.
The imbalance degree of the data set determines the value of , and the imbalanced level (IL) between majority and minority of the data set is calculated according to where ( ) represents the value obtained by rounding up IL. Through the above interpolation operation, the majority and the minority samples can be effectively balanced to improve the accuracy of classification. In this study, the minority classes of 2 data sets are balanced by SMOTE, and the processed results after are as Table 2.
. . Classifier and Performance Measures. In order to facilitate the comparison with other feature extraction algorithms, we used support vector machine (SVM) as the classification model in this study. After the feature extraction of protein sequences, the universal package of LIBSVM developed by Lin was applied to construct the SVM multiclass classifier [58]. The Jackknife test was also adopted to examine the effectiveness of classifier in our experiment. Jackknife test has the least arbitrary that can always yield a unique result for a given benchmark dataset [59]. Furthermore, in order to have a more comprehensive evaluation, sensitivity (Se), specificity (Sp), Matthew's correlation coefficient (MCC), and the overall accuracy (OA) over the entire data set are applied as the evaluation index [20,21,60]. These parameters are detailed in = + where TP, TN, FP, and FN are the number of true positives, true negatives, false positives, and false negatives, respectively; is the total number of protein sequences and is the class number.
. . Parameters Selection. There are two key parameters in this study. One is the length of sequence fragment in the local feature extraction. The shortest protein sequence length in the data set is 50, and the fragment length is selected between 25 and 50. Figure 3 shows the prediction accuracy of the data set ZD98 and CH317, respectively, when taking different slice lengths.
As shown in Figure 3, when the sequence length is between 35 and 40, the prediction accuracies on the data sets ZD98 and CH317 are the highest and tend to be stable, and the  current length is the optimal value. The optimal values for the two data sets used in this study are 35 and 40, respectively. When using PCA to select the final feature vectors, the setting of dimension D has an effect on the accuracy of the entire algorithm. The more dimensions are selected and the more features are included, but the training time of the classifier may be too long. The smaller the dimension is, the more likely it is to lose some truly meaningful features and affect the classification effect. Therefore, an optimal needs to be sought through experiments. Figure 4 shows the prediction accuracy corresponding to the different taken by the data sets ZD98 and CH317 during the feature selection of PCA.
As shown in Figure 4, when the dimension of the feature vector is low, the prediction accuracy of two data sets is relatively low. When the dimension is higher than a certain value, the prediction accuracy is also reduced. When the dimension is between 60 and 70, the prediction accuracies on the data sets ZD98 and CH317 are the largest and tend to be steady, and the current is the optimal value. The optimal values for the two data sets used are 60 and 65, respectively.

Result and Discussion
The prediction results of our experiments by Jackknife on the data sets ZD98 and CH317 are listed in Tables 3, 4, and 5. To further illustrate the effectiveness of our method, the prediction results in each subcellular location of two data sets are also listed in Tables 3-5, which are sensitivity, specificity, correlation coefficient, and overall accuracy, respectively.
It can be seen from Table 3 that the method has obtained good experimental results on both two data sets, and the total accuracies rates are 96.7% and 94.8%, respectively. The experiment proves that the method can effectively increase the accuracy of the prediction of protein subcellular localization. At the same time, in order to facilitate the comparison with other methods, we have listed some experimental results based on some improved algorithms of protein sequence feature extraction in the past several years.
In Tables 4 and 5, DCC SVM comes from Liang [40], by using detrended cross-correlation coefficient(2016); OF SVM comes from Zhang [41], by using -Order Factor and principal component analysis(2017); DE SVM comes from Liang [42], by fusing two different descriptors based on evolutionary information(2018); BOW SVM comes from Zhao [24], by using bag of words(2017); GA SVM comes from Liang [17], by using geary autocorrelation and DCCA coefficient(2017); OA SVM comes from Zhang [43], by using oversampling and pseudo amino acid composition(2018); IAC SVM comes from Zhang [44], by using integrating auto-cross correlation and PSSM(2018); EI SVM comes from Xiang [45], by using evolutionary information(2017); CF SVM comes from Chen [46], by using a set of discrete sequence correlation factors(2015); all the methods use SVM as the final classifier.
It can be seen from Table 4 that the result on the data set ZD98 has a maximum improvement of the overall prediction accuracy, increasing by about 6 to 8 percentage points compared with traditional protein sequence feature extraction algorithms such as DCC SVM, OF SVM, and DE SVM. In the subcellular class of cytoplasmic proteins, the prediction accuracy rate is 100%, which means that all the sequences in this class are predicted correctly, and the overall prediction accuracy is better than other methods as well. Compared the experimental results with other improved feature extraction algorithms such as BOW SVM, GA SVM, and OA SVM, the accuracy on the same data set is also improved by about 3 to 5 percentage points. Experiments show that the proposed method indeed provides a better source of information for protein sequences and have significant advantages than other similar feature extraction methods. From the comparison in Table 5, we can see that the prediction result on mitochondrial proteins of data set CH317 is up to 96.4%, which is about 4.1% to 14% higher than other algorithms. The accuracy rate in the class of Nuclear has also increased by 14.2% maximally, improving the total prediction accuracy by 3.3 to 4.3 percentage points compared with the improved algorithms such as IAC SVM, EI SVM, and CF SVM, which further demonstrates that the method can optimize the underlying features of the sequence and effectively improve the prediction accuracy of apoptosis protein subcellular localization. Compared with the traditional protein sequence feature extraction and their improved methods, the time complexity of our algorithm is not only low but can also achieve better results based on the simple AAC feature. The background information of the feature representation can also be extracted by mean pooling and comprehensively reflect the distribution of sequence features more, as well as improving the classification accuracy.

Conclusions
Prediction of apoptosis protein subcellular localization has always been the hotspot of bioinformaticians all over the world. Based on the traditional protein sequence feature extraction algorithm AAC, this paper introduced sparse coding to optimize sequence features and proposed a feature fusion method based on multilevel dictionary. The main contribution includes firstly using sliding window segmentation to extract the sequence fragments of protein sequences, and the traditional feature extraction algorithm was used to encode them. Then the K-SVD algorithm was used to learn the dictionary, and the sequence feature matrix was sparsely represented by the OMP algorithm. The feature representation based on different sizes of dictionaries is mean-pooled to help extract the overall and local feature information. Finally the SVM multiclass classifier is used to predict the subcellular location of the proteins. Experiments show that the proposed method can obtain better results in the prediction success rate of most subcellular classes and have important guiding significance for improving the feature expression of traditional apoptosis protein sequence feature extraction algorithms. Generally speaking, it is a relatively effective method for predicting the subcellular localization of apoptosis proteins.

Data Availability
The data used to support the findings of this study is available from the corresponding author upon request, and you can also find it from https://github.com/Multisc/Multi sc subloc.