High-Throughput Identification of Mammalian Secreted Proteins Using Species-Specific Scheme and Application to Human Proteome

Secreted proteins are widely spread in living organisms and cells. Since secreted proteins are easy to be detected in body fluids, urine, and saliva in clinical diagnosis, they play important roles in biomarkers for disease diagnosis and vaccine production. In this study, we propose a novel predictor for accurate high-throughput identification of mammalian secreted proteins that is based on sequence-derived features. We combine the features of amino acid composition, sequence motifs, and physicochemical properties to encode collected proteins. Detailed feature analyses prove the effectiveness of the considered features. Based on the differences across various species of secreted proteins, we introduce the species-specific scheme, which is expected to further explore the intrinsic attributes of specific secreted proteins. Experiments on benchmark datasets prove the effectiveness of our proposed method. The test on independent testing dataset also promises a good generalization capability. When compared with the traditional universal model, we experimentally demonstrate that the species-specific scheme is capable of significantly improving the prediction performance. We use our method to make predictions on unreviewed human proteome, and find 272 potential secreted proteins with probabilities that are higher than 99%. A user-friendly web server, named iMSPs (identification of Mammalian Secreted Proteins), which implements our proposed method, is designed and is available for free for academic use at: http://www.inforstation.com/webservers/iMSP/.


Introduction
Secreted proteins (SPs) are the proteins that are released by a cell or tissue into the extracellular space. Generally, these proteins are produced through two pathways, namely the classical Endoplasmic Reticulum and Golgi routes [1] and the unclassical secretory routes [2]. Secreted proteins play important roles in living organisms. According to their functions, they could be divided into many categories, which include hormones [3], cytokines [4], enzymes [5], toxins [6], and antibiotics [5]. In humans, the liver is the most important secretory organ. It produces a large number of plasma proteins, such as albumin, fibrinogen, and transferrin. Secreted proteins are easy to detect in body fluids, urine, and saliva in clinical trials [7], which endows them with the capability of being a rich source of biomarkers and drug targets. Since the majority of the blood diagnostic tests are directly towards secreted proteins, it is not unusual to emphasize the significance of this class of proteins.
Recent years have witnessed a number of computation-based approaches in this field. In 2011, Hong et al. used physiochemical properties and amino acid composition features to predict whether a We further illustrate the distribution of physicochemical properties in Figure 2. Midline, box boundaries, and whiskers indicate median, quartiles, and 10th and 90th percentiles. The x-axis indicates the normalized values and y-axis stands for twelve properties. For instance, the distribution of secreted proteins against the non-secreted proteins varies obviously in hydrophobicity (Panel A). This phenomenon keeps consistent in SPs-H, SPs-M, SPs-B, and SPs-C. In SPs-H or SPs-B, a significant difference is found on the distributions of the entropy of formulation and protein kinase A. In SPs-M, the difference on protein kinase A is mild, but that on polarity is remarkable. In SPs-C and SPs-O, the majority of the considered physicochemical attributes show a big difference in secreted proteins against the non-secreted proteins. We further illustrate the distribution of physicochemical properties in Figure 2. Midline, box boundaries, and whiskers indicate median, quartiles, and 10th and 90th percentiles. The x-axis indicates the normalized values and y-axis stands for twelve properties. For instance, the distribution of secreted proteins against the non-secreted proteins varies obviously in hydrophobicity (Panel A). This phenomenon keeps consistent in SPs-H, SPs-M, SPs-B, and SPs-C. In SPs-H or SPs-B, a significant difference is found on the distributions of the entropy of formulation and protein kinase A. In SPs-M, the difference on protein kinase A is mild, but that on polarity is remarkable. In SPs-C and SPs-O, the majority of the considered physicochemical attributes show a big difference in secreted proteins against the non-secreted proteins. , polarity (PCP. [2]), solvation free energy (PCP. [3]), graph shape index (PCP. [4]), transfer free energy (PCP. [5]), correlation coefficient in regression analysis (PCP. [6]), residue accessible surface area (PCP. [7]), partition coefficient (PCP. [8]), entropy of formulation (PCP. [9]) and protein kinase A (PCP. [10]), respectively.. Table 1 are the calculated top 20 informative motifs in various datasets. We find that 'L'-rich (leucine-rich) MTFs are highly favored in SPs-all and SPs-H (exemplified by Figure 3). Extracellular leucine-rich pattern domains are proved to be the key organizers of connectivity among the development of neural circuits in secreted proteins [22]. It also regulates axon guidance, target selection, synapse formation, and the stabilization of connections [23]. The 'L'-rich MTFs in different secondary structures usually indicates various structure functions. As shown in Figure 3, the 'L'-rich MTFs are always located at the intrinsically disordered region ( Figure 3A, 'LLLL' motif), the middle of the coil ( Figure 3B, 'LAL-L' motif), and the edge of the helix ( Figure 3C, 'L-LLA' motif). For instance, 'L'-rich MTFs in -helices often shows pronounced curvature, while the -strand usually expresses effective binding interaction [24]. Since 'L'-rich MTFs is an efficient structure, it endows them the capability of regulating intercellular communication and cell adhesion. This can explain why they are most favored in secreted proteins [24]. 'G'-rich motifs are prevalent in SPs-M, SPs-C, and SPs-O. These phenomena keep consistent with that in amino acid compositions. However, although 'C' is under-represented in secreted proteins, it plays important roles in the compositions of MTFs. The enriched conditions of 'L' and 'G' might be a reason for such phenomenon. Although 'C' residues are depleted in secreted proteins, we find that the 'C'-rich motifs are enriched in various species of secreted proteins. More detailed information of these MTFs is provided in Table S1. The physicochemical index data for twenty standard amino acids is listed in Table S2.  [7]), partition coefficient (PCP. [8]), entropy of formulation (PCP. [9]) and protein kinase A (PCP. [10]), respectively.. Table 1 are the calculated top 20 informative motifs in various datasets. We find that 'L'-rich (leucine-rich) MTFs are highly favored in SPs-all and SPs-H (exemplified by Figure 3). Extracellular leucine-rich pattern domains are proved to be the key organizers of connectivity among the development of neural circuits in secreted proteins [22]. It also regulates axon guidance, target selection, synapse formation, and the stabilization of connections [23]. The 'L'-rich MTFs in different secondary structures usually indicates various structure functions. As shown in Figure 3, the 'L'-rich MTFs are always located at the intrinsically disordered region ( Figure 3A, 'LLLL' motif), the middle of the coil ( Figure 3B, 'LAL-L' motif), and the edge of the helix ( Figure 3C, 'L-LLA' motif). For instance, 'L'-rich MTFs in α-helices often shows pronounced curvature, while the β-strand usually expresses effective binding interaction [24]. Since 'L'-rich MTFs is an efficient structure, it endows them the capability of regulating intercellular communication and cell adhesion. This can explain why they are most favored in secreted proteins [24]. 'G'-rich motifs are prevalent in SPs-M, SPs-C, and SPs-O. These phenomena keep consistent with that in amino acid compositions. However, although 'C' is under-represented in secreted proteins, it plays important roles in the compositions of MTFs. The enriched conditions of 'L' and 'G' might be a reason for such phenomenon. Although 'C' residues are depleted in secreted proteins, we find that the 'C'-rich motifs are enriched in various species of secreted proteins. More detailed information of these MTFs is provided in Table S1. The physicochemical index data for twenty standard amino acids is listed in Table S2.

The Performance of the Extracted Features
In Section 2.1, we analyze the differences across various species of secreted proteins and non-secreted proteins on considered features. However, it is still unknown whether these features can be used to distinguish secreted proteins from non-secreted proteins. Here, we test these features on general SPs-all and five species-specific datasets. Table 2 shows the prediction performance of the considered different features on the training datasets over five-fold cross-validation. Overall, the features of AAC, MTF, and PCP produce promising results on the general mammalian secreted proteins datasets and six species-specific secreted proteins. In detail, AAC-based features perform the best among three types of features with the highest Matthews Correlation Coefficient (MCC) and AUC values on SPs-all. Although MTF-based features could not achieve the highest prediction performance, they are featured by the high capability in recognizing non-secreted proteins (Specificity > 0.84). For Mammalia, B. taurus, and C. lupus familiaris secreted proteins, the MTF-based features give out high specificity, which is above 0.9. In comparison with ACC-and MTF-based features, PCP-based features produce similar results on six training datasets.

The Performance of the Extracted Features
In Section 2.1, we analyze the differences across various species of secreted proteins and non-secreted proteins on considered features. However, it is still unknown whether these features can be used to distinguish secreted proteins from non-secreted proteins. Here, we test these features on general SPs-all and five species-specific datasets. Table 2 shows the prediction performance of the considered different features on the training datasets over five-fold cross-validation. Overall, the features of AAC, MTF, and PCP produce promising results on the general mammalian secreted proteins datasets and six species-specific secreted proteins. In detail, AAC-based features perform the best among three types of features with the highest Matthews Correlation Coefficient (MCC) and AUC values on SPs-all. Although MTF-based features could not achieve the highest prediction performance, they are featured by the high capability in recognizing non-secreted proteins (Specificity > 0.84). For Mammalia, B. taurus, and C. lupus familiaris secreted proteins, the MTF-based features give out high specificity, which is above 0.9. In comparison with ACC-and MTF-based features, PCP-based features produce similar results on six training datasets.

The Performance of Feature Selection Scheme
We empirically prove the prediction capability of proposed features in Section 2.2. In this section, we combine three types of features together to construct the feature space. When considering the existence of redundant features, we firstly use Fisher-Markov Selector [25] to calculate the coefficients between each of the features and labels. The ranked feature lists are provided in Figure S1. Next, we iteratively add features into the feature subset according to the incremental feature selection strategy. Table 3 shows the prediction results that are based on the optimal feature subsets.  Table S3.

Comparison of Species-Specific Models with Traditional Universal Ones
Based on our previous investigation, different species of secreted proteins show various attributes in many aspects. Then, we are inspired to introduce species-specific strategy for the specific identification of various mammalian secreted proteins. When compared with universal models, species-specific ones are based on specific feature construction and optimal feature subsets. To investigate the effectiveness of this strategy, we compare these two kinds of models based on same benchmark training datasets over five-fold cross-validation. As shown in Table 4, species-specific models all achieve relatively higher (2~11%) prediction accuracy. The improvements are much more obvious on the sensitivity (3~18%) for different species expect for M. musculus. When considering MCC, which is capable of balancing the measurements between sensitivity and specificity, the species-specific model all produce higher values. Figure 4 displays the AUCs of species-specific and universal models. The grey bars indicate the species-specific models, while the black ones stand for the universals.

Comparison with Other Predictors on Independent Testing Datasets
To evaluate the generalization capability of the proposed predictor as well as to compare with previous methods, we further test our method on the independent testing dataset. Recent years have witnessed several powerful predictors for identification of SPs, such as SecretomeP, NClassG+, and SRTpred. The criteria that used for selecting efficient methods include (1) the outputs of the predictors are scores and (2) the predictors can successfully predict an average length protein

Comparison with Other Predictors on Independent Testing Datasets
To evaluate the generalization capability of the proposed predictor as well as to compare with previous methods, we further test our method on the independent testing dataset. Recent years have witnessed several powerful predictors for identification of SPs, such as SecretomeP, NClassG+, and SRTpred. The criteria that used for selecting efficient methods include (1) the outputs of the predictors are scores and (2) the predictors can successfully predict an average length protein sequence with 200 residues within 30 min. As a result, we select two predictors, namely SecretomeP [26] and SRTpred [27], as of January 2018. Table 5 lists the prediction results of considered predictors on various types of testing datasets. The predicted values of SecretomeP and SRTpred are directly obtained through their software. All of the predictors achieve good performance on the universal and various species-specific datasets. Our universal module (iMSP-U) produces the MCC of 0.427, 0.455, 0.507, 0.359, 0.324, and 0.332 on six testing datasets respectively. On the former four testing datasets, our iMSP-U outperforms SecretomeP and SRTpred. On SPs-C and SPs-O's testing sets, SecretomeP and SRTpred show much better than our iMSP-U. When adopting species-specific models (iMSP-H, iMSP-M, iMSP-B, iMSP-C, and iMSP-O) on the corresponding species-specific testing datasets, the prediction performance shows obvious improvements.

Application to Predict Secreted Proteins from Human Proteome by Using iMSP
We implement the proposed method as a public web server, named iMSP, which is deployed at http://www.inforstation.com/webservers/iMSP/. iMSP offers efficient high-throughput predictions for biologists. In this work, our new-compiled benchmark dataset was generated from UniProt (http://www.uniprot.org/, accessed on 1 January 2018). In the UniProt database, sequence similarity search programs are used to identify orthologs. Since H. sapiens and M. musculus secreted proteins occupy a large part of all secreted proteins, they would somehow influence other species of secreted proteins, such as B. taurus, C. lupus familiaris, and O. cuniculus. In our benchmark dataset, the number of secreted proteins in SPs-H and SPs-M is much higher than that of SPs-B, SPs-C, and SPs-O. As a result, the accuracy of the latter three species-specific models will be affected by that of the first two. The users are suggested to choose universal model for Bos, Canis and Oryctolagus proteins, and species-specific models for Homo and Mus proteins.
In this part, we aim to adopt iMSP to predict potential secreted proteins from human proteome. There are a total of 71,772 proteins in the human proteome. Among them, 20,303 items are reviewed records, and the rest 51,469 are unreviewed ones. Particularly, by our universal model (iMSP-U) and species-specific model (iMSP-H), we also calculate the probabilities of unreviewed human proteins to be secreted proteins. All of the proteins were ranked according to the predicted probabilities. Based on iMSP-H, we find that 7601 (14.77%) proteins have the probabilities higher than 0.8, while a large number of proteins are not secreted proteins (shown in Table 6). When considering the highest probabilities (≥99%), we find 272 (or 0.528%) out of all 51,469 proteins to be predicted secreted proteins. Finally, we listed the predicted scores for all unreviewed human proteome (Table S4) and potential SPs with highest probabilities (Table S5). Table 6. Predicted probabilities to be potential secreted proteins in human proteome.

Datasets Preparation
In this study, we collect 17,209 mammalian secreted proteins and 29,479 non-secreted proteins from UniProt. We take consideration of the prevalent several species, which include Homo sapiens (H. sapiens), Mus musculus (M. musculus), Bos taurus (B. taurus), Canis lupus familiaris (C. lupus familiaris), and Oryctolagus cuniculus (O. cuniculus). The species-specific datasets are used to explore the differences across the various mammalian secreted proteins. Next, Blastclust [28] is used to cluster these proteins with a threshold of 30%. We pick the longest protein from each cluster as the representative. For each dataset, we randomly pick four-fifths of secreted proteins and an equal number of non-secreted proteins to build the training/cross-validation dataset. The remaining proteins are used for independent testing. Table 7 summarizes the newly-compiled dataset. These datasets are freely available on the iMSP server.

Amino Acid Composition-Based Features
Amino acids are the fundamental elements of proteins. The features of amino acid composition (AAC) reflect the distribution of amino acids in proteins [29,30]. AAC is widely used in predicting protein function or structures. Given a protein P, the features of AAC are defined, as follows: where f aa represents the calculated frequency of 20 types of amino acids in the sequence P. Then, these frequencies are normalized to the interval [−1, 1] by using: where f n , max( f aa ), and min( f aa ) are the original, maximum, and minimum calculated frequency of the amino acids.

Sequence Motif-Based Features
Proteins in the same family tend to share similar attributes. These attributes are usually located on the highly conserved parts of the proteins. These conserved parts can be recognized by sequence patterns/motifs [31]. In this study, we adopt information theory [32] in order to calculate the features of sequence motif (MTF) from protein sequences. Given a protein, the information entropy of the MTF can be formulated, as follows: where N is the number of the considered proteins. Next, we reclassify these proteins with MTF 'M'. The updated information entropy can be formulated as: where P(M) represents the percentage of proteins containing 'M', while P M means the opposite. The Information Gain (IG), which is produced by the introduction of MTF 'M' can be calculated as: In real-world cases, the imbalance on number of SPs to non-SPs would somewhat lead to potential bias on the selected motifs based on IG. Considering this, we further calculate the ratio of the difference value of IG (RDI) for target MTF 'M', which is defined as follows: where IG P (M) and IG N (M) are IG of MTF 'M' on SPs and non-SPs, I P (S) and I N (S) are the original information entropy of secreted proteins and non-secreted proteins. In this study, we select the top 20 informative MTFs to encode each protein. Finally, the feature of MTF is defined as: where M n represents the existence or not of the n-th motif ('1' stands for existence; '−1' refers to the opposite).

Physicochemical Properties-Based Features
The physiochemical properties (PCP) of residues reveal microscopic environment of proteins. These microscopic environments includes protein energy, fore, and dynamics [33]. For example, the interfaces are often associated with hydrophobic or polar residues [33]. Graph shape can somewhat determine the surface of the function regions. In view of this, we collect ten popular physicochemical properties to encode the secreted proteins. These properties include hydrophobicity [34], polarity [35], solvation free energy [36], graph shape index [37], transfer of free energy [38], correlation coefficient in regression analysis [39], residue accessible surface area [40], partition coefficient [41], entropy of formulation [42], and protein kinase A [43].
where I m,n represented the m-th index data for the n-th type of amino acid. Detailed information of these index data are provided in Supplementary Table S1. Given a protein, its total sequence can be mathematically formulated as SEQ = [A 1 , A 2 , . . . , A L ], where L is length of the protein, A n is a 20 × 1 submatrix representing amino acids (digital "1" for the occupation and "0" for the opposite). Then, the feature of physicochemical patterns can be formulated as: where PCP n was the average value of the n-th column in the matrix product of Equation (8) and SEQ. These elements are scaled between −1 and 1 using Equation (2).

Feature Selection Strategy
In information theory, the existence of 'bad' (noisy or irrelevant) features will potential destroy the classifier or will lead to overfitting [44]. Therefore, it is necessary to remove the bad features before constructing a powerful model. In this study, we introduce Fisher-Markov Selector (FMS) [25], together with incremental feature selection (IFS) strategy to search the optimal feature subset. It uses Markov random field optimization techniques to identify the most informative features in describing the native labels. Incremental feature selection strategy is adopted to build different feature subset, according to the scored feature lists. For each feature subset, a classifier is built and evaluated. The classifier that achieves the highest prediction performance will be chosen as the final prediction model. The corresponding feature subset will be the optimal feature subset.

Model Construction and Performance Evaluation
In this work, LIBSVM 3.20 [33] is utilized to empirically train and optimized the prediction model. The radial basis function is adopted as the kernel function and grid search is used to search for optimal parameters. We assess our method using two statistical cross-validation methods, namely five-fold cross-validation and the independent test. A five-fold cross-validation is adopted for evaluating the performance of proposed predictor on the training dataset. First, we randomly divide the training dataset into five parts. In each run, four of them are used to train a classifier and test on the holdout fold. Then, we combine the predictions in all five iterations to compute the following threshold-dependent measurements: accuracy, sensitivity, specificity, and Matthews Correlation Coefficient (MCC). They are defined, as follows: where TP is the number of correctly recognized secreted proteins, TN is the number of correctly recognized non-secreted proteins, FP is the number of incorrectly recognized secreted proteins, and FN is the number of incorrectly recognized non-secreted proteins. Since the abovementioned threshold-dependent measurements are sensitive to thresholds, we also adopt AUC (area under Receiver Operating Characteristic (ROC) curve), which has been proved to be a robust assessment criterion for imbalanced testing datasets [34].

Conclusions
Secreted proteins are widely spread in living organisms and cells. Featured by easily being detected in body fluids, urine, and saliva in clinical, they play important roles in potential biomarkers for disease diagnosis and vaccine production. In this study, we present a novel high-throughput predictor for the identification of mammalian SPs from primary protein sequences. We analyze the differences across various types of secreted proteins and non-secreted proteins by using considered features, including AAC, MTF, and PCP. When compared with the traditional universal model, the introduced species-specific scheme proves to be capable of improving the prediction performance for corresponding species of secreted proteins. Tests on independent testing dataset promise a good generalization capability of our proposed method. We also apply the proposed predictor to predict unreviewed human proteome. We list 272 potential secreted proteins, which are predicted with high confidence (≥99%), for further investigation by biologists.
Supplementary Materials: The following are available online http://www.mdpi.com/1420-3049/23/6/1448/s1. Table S1: The selected 20 motifs in six datasets. Table S2: Physicochemical index data for twenty standard amino acids. Table S3: Performance of different numbers of features in six training datasets over five-fold cross-validation. Table S4: The predicted scores for all unreviewed human proteome by iMSP. Table S5: The predicted scores for potential SPs with highest probabilities by iMSP. Figure S1 Feature ranking in six training sets.