A New Method for Recognizing Cytokines Based on Feature Combination and a Support Vector Machine Classifier

Research on cytokine recognition is of great significance in the medical field due to the fact cytokines benefit the diagnosis and treatment of diseases, but the current methods for cytokine recognition have many shortcomings, such as low sensitivity and low F-score. Therefore, this paper proposes a new method on the basis of feature combination. The features are extracted from compositions of amino acids, physicochemical properties, secondary structures, and evolutionary information. The classifier used in this paper is SVM. Experiments show that our method is better than other methods in terms of accuracy, sensitivity, specificity, F-score and Matthew’s correlation coefficient.


Introduction
Cytokines are mainly proteins or peptides generally associated with inflammation and cell differentiation, including interleukins (IL), interferons (IFNs), etc. They can contribute to the diagnosis and treatment of diseases [1,2], for example, they are conducive to the treatment of hematopoietic dysfunction [3], tumor [4], infection [5,6] and inflammation [7,8]. With the post-genomic era coming, the quantity of new proteins has increased dramatically. Some complex experiments based on biology and immunology are required to identify cytokines from these new proteins, which means high time consumption and expensive cost, but the primary structures of these proteins are easier to obtain. In this case, we propose a method which can recognize cytokines quickly and accurately. It is helpful for medical scientists, for instance, the method can reduce the experimental range and workload. They can selectively test the functions for the proteins which are identified as cytokines by this method. In other words, there is no need to test the functions which cannot be related to cytokines. Generally speaking, the method we proposed has two important steps, i.e., the extraction of features and the classifier selection.
There are many algorithms for extracting features from protein sequences. The following will introduce several algorithms for feature extraction. Nakashima et al. [9] first proposed an algorithm which can obtain 20-dimensional features from a protein sequence based on amino acid composition (AAC for short). After that, Luo et al. [10] further extended it to 400 dimensions in terms of the composition information of polypeptide. Based on the physicochemical properties of proteins, Shen et al. proposed the PseAAC algorithm [11]. Besides, several algorithms have been proposed in terms of position-specific score matrices (PSSM) which can be computed by PSI-BLAST software [12]. Additionally, several features [13,14] can be obtained based on structure probability matrices (SPM) and secondary structure sequences (SSS), which can be extracted by PSI-PRED software [15]. The method based on local discriminant bases is used in the latest classification of protein [16] and an algorithm combining PSSM and PseAAC is currently proposed for protein prediction [17].
There are many classifiers for recognizing protein function, such as Rough Set [18], Naive Bayes [19] and SVM [20], of which SVM is very effective. As a supervised learning model, SVM has been widely used in many domains due to its simplicity and efficiency, such as protein subcellular prediction [21], HIV-1 and HIV-2 proteins prediction [22], gene selection [23], protein subcellular localization [24], pre-microRNA prediction [25] and membrane protein function prediction. But the used Data have a serious effect on the efficiency of SVM.
So far, a considerable amount of literature has been published on the prediction of cytokines. Huang et al. used SVM and 7-fold cross-validation to predict and classify the cytokine superfamily [26]. Zeng et al. improved the ability of cytokine prediction by n-gram algorithm and genetic algorithm [27]. Jiang et al. enhanced the performance of cytokine identification by using proper features and classifiers [28]. The advantages of these methods are high accuracy and specificity. However these methods also have some shortcomings, such as low sensitivity and low F-score. For example, the best accuracy in the experiments of Jiang et al. is 93% but the corresponding sensitivity only is 51%.
In this paper, we aimed to improve the performance of cytokines recognition and the efficiency of the SVM classifier when the number of positive and negative sequences is not balanced. Compared with the feature extraction methods in the papers of Huang et al. and Zeng et al., we comprehensively considered four aspects of protein sequences for building a stable model. Meanwhile, compared with the current method, we improved a new method of feature extraction and adopt it in our model. A lot of experiments were done by us to find the optimal feature combination and the hyper-parameter of SVM. Besides, we gave the suggestion for further improvement. We initially obtained 450 features on the basis of composition information and physicochemical properties. Then we extracted structure probability matrices, position-specific score matrices, second structure sequences and selected 418 features from them as a vector. Finally, we selected 448 features from 868 features (450 plus 418) as optimal features and SVM was used as our classifier. The experiments show that our method is superior to other researches by using 10-fold cross-validation methods and an independent test set.

Classifier and Verification Methods
SVM is a suitable classifier for protein prediction and classification, such as the prediction of protein secondary structure [29], the prediction of protein interaction [30], protein structure classification [31] and the prediction of protein families [32]. SVM is more applicable to a binary classification problem, which is consistent with our problem. Therefore, we chose SVM to classify protein sequences. It can construct a linear separating hyperplane in high-dimensional feature space to distinguish positive sequences and negative sequences. However, it will be of high computational complexity to compute a hyperplane in high-dimensional space, but SVM can reduce the cost of computation by means of kernel functions. We used the linear kernel function and Gaussian kernel function in our experiments and the LIBSVM [33] software. 10-Fold cross-validation was used to evaluate the generalization ability of our method. In the process of the 10-fold cross-validation, all sequences are randomly divided into ten groups. One of them is used as the test set, and the rest is used as the training set. It should be ensured that each one of ten groups has been tested separately, therefore the evaluated method will be tested ten times in one experiment and there will be ten results. The average of ten results is taken as the final evaluation result. Besides, we also prepared an unbiased independent test set which is randomly selected from the total sample and accounts for the 20% of it. In the test set, the ratio of the positives to the negatives is 1:9, which is the same as that in training set. It can be ensured that the similarity of each sequence between training set and test set is less than 0.6 by the CD-HIT software. We have uploaded the independent test set to GitHub (https://github.com/DeveloperMrYang/FCSVM).

Measurements
Acc, Sens, Spec, F-score and Mcc were used to measure the performance of our method, and they can be calculated by the following formulas. Here TP is the number of true positives, i.e., the number of the positives which are predicted as positives. FP is the number of false positives, i.e., the number of the negatives which are predicted as positives. TN is the number of true negatives, i.e., the number of the negatives which are predicted as negatives. FN is the number of false negatives, i.e., the number of the positives which are predicted as negatives. Acc measures the accuracy of the method, which is the ratio of correctly predicted data in all tested data. Sens measures the sensitivity of the method, which is the ratio of correctly predicted positives in all tested positives. Spec measures the specificity of the method, which is the ratio of correctly predicted negatives in all tested negatives. Pre is the ratio of correctly predicted positives in all predicted positives. F-socre measures the quality of the method by considering Sens and Pre. Mcc is Matthew's correlation coefficient [34]. It should be noted that Mcc is −1 when both TP and FP are 0 and the model has the poorest predictive ability:

Performance of Feature Methods
We separately tested the performance of four feature extraction methods mentioned above. The classifier used by us was SVM and we used the linear kernel function and the Gaussian kernel function. Table 1 shows the results of each feature extraction method with different kernel functions. In the experiments, both TP and FP are 0 when using the n-gram feature method and Gaussian kernel function, therefore the prediction is invalid, which is not listed in Table 1. The results show that the F PseAAC with Gaussian kernel function performs best in Acc, F-score, and Mcc. Besides, in the recognition of negative sequences, the F pssm-380 with Gaussian kernel function is the best because the Spec value reaches 93.250%. Correspondingly, in the identification of positive sequences, the F n-gram with linear kernel function performs best and the Sens is 85.576%.

Feature Combinations
We combined the F pssm-20 with F pssm-380 to obtain a total of a 400-D feature vector (F pssm for short), and combined F sss with F PseAAC to obtain a 48-D feature vector (F sp for short). F pssm was combined with F sp to obtain a 448-D feature vector (F psp for short), and finally F n-gram was combined with F psp to get a 868-D feature vector (F pspn for short). Table 2 shows the results of feature combination methods with different kernel functions.
From Table 2, we found that the F pspn with 868 dimensions and linear kernel function performs better in almost all aspects than other vectors. Nevertheless, it should be noted that the dimensions of F pspn are 420 more than the dimensions of F psp , but the predictive ability of F psp with liner kernel function only decreases by 0.2% or so. Thus, considering time cost and the quality of feature combinations comprehensively, we selected F psp with 448 dimensions as our final feature vector.
A grid search algorithm and 3-flod cross-validation were used to choose the best parameters (C and γ) for F psp with Gaussian kernel. C and γ range from 2 −5 to 2 5 with a step size of 2 1 . The accuracy of different C and γ is shown in Table 3. When C is 2 2 and γ is 2 −5 , we get the optimal accuracy.
After parameters optimization, the new result shows that F psp with Gaussian kernel is better than it with linear kernel and we conducted 10 times 10-fold cross-validation on it. The mean ± standard deviation for each measurement is shown in Table 4. In the final result, the accuracy is 90.84%, the sensitivity is 89.23%, the specificity is 92.42%, the F-score is 90.63% and the Mcc is 81.71%.  Table 3. The accuracy of different C and γ for F psp with Gaussian kernel.

Comparison with Other Methods
Jiang et al. [28] downloaded cytokines sequences from the Uniprot database website and used CD-HIT software to process data. Our methods of data acquisition and processing are the same, but the number and proportion of positive and negative sequences we used are different, which will have an effect on the comparison results. In order to eliminate the factor, we randomly divided the 9146 positive sequences into nine groups, then combined positive sequences in each group and all negative sequences into a new dataset. In the new dataset, the number and proportion of positive and negative sequences are consistent with those of Jiang et al., and we did new experiments on the nine datasets to find the optimal feature combination. It follows that F psp with liner function is better than other feature combinations in almost all metrics except Spec from Table 5. From all results, we finally took F psp with 448 dimensions as our optimal feature vector.
We compared our method, i.e., feature combination F psp with SVM and linear kernel function, with three methods of Jiang et al. The results are shown in Figures 1-3. Figure 1 shows the comparison between our method and the method using 473-D feature vector and SVM. Figure 3 shows the comparison between our method and the method using MRDR dimensionality reduction method and the LIBD3C classifier. Figure 3 shows the comparison between our method and the method using PCA dimensionality reduction method and the BP-NN classifier. It is obviously that our method is better than their methods from those figures. In the independent test set we prepared, we also obtain high performance. The accuracy is 93.25% which is also higher than their methods.

Comparison with Other Methods
Jiang et al. [28] downloaded cytokines sequences from the Uniprot database website and used CD-HIT software to process data. Our methods of data acquisition and processing are the same, but the number and proportion of positive and negative sequences we used are different, which will have an effect on the comparison results. In order to eliminate the factor, we randomly divided the 9146 positive sequences into nine groups, then combined positive sequences in each group and all negative sequences into a new dataset. In the new dataset, the number and proportion of positive and negative sequences are consistent with those of Jiang et al., and we did new experiments on the nine datasets to find the optimal feature combination. It follows that Fpsp with liner function is better than other feature combinations in almost all metrics except Spec from Table 4. From all results, we finally took Fpsp with 448 dimensions as our optimal feature vector.
We compared our method, i.e., feature combination Fpsp with SVM and linear kernel function, with three methods of Jiang et al. The results are shown in Figures 1-3. Figure 1 shows the comparison between our method and the method using 473-D feature vector and SVM. Figure 3 shows the comparison between our method and the method using MRDR dimensionality reduction method and the LIBD3C classifier. Figure 3 shows the comparison between our method and the method using PCA dimensionality reduction method and the BP-NN classifier. It is obviously that our method is better than their methods from those figures. In the independent test set we prepared, we also obtain high performance. The accuracy is 93.25% which is also higher than their methods.

Discussion
Useful information can be extracted from large amount of sequences by appropriate feature extraction methods. Meanwhile, the choice of classifier has an impact on the results of cytokines recognition. In order to obtain higher performance, we focused on the optimal feature combination and adjusted the hyper-parameter appropriately in our experiment, which can promote the characterization of cytokines and build a more appropriate classifier.
Although we achieved higher accuracy, we have some suggestions for further improvement. Firstly, feature selection using sparse regressions [35] like LASSO or elastic net can help further improve the performance and better understand. Secondly, SVM has been widely used in many fields. Thus, we suggest that some deep learning methods should be used in the next experiments, such as deep neural networks with large dataset or the cascade random forests with small dataset.

Discussion
Useful information can be extracted from large amount of sequences by appropriate feature extraction methods. Meanwhile, the choice of classifier has an impact on the results of cytokines recognition. In order to obtain higher performance, we focused on the optimal feature combination and adjusted the hyper-parameter appropriately in our experiment, which can promote the characterization of cytokines and build a more appropriate classifier.
Although we achieved higher accuracy, we have some suggestions for further improvement. Firstly, feature selection using sparse regressions [35] like LASSO or elastic net can help further improve the performance and better understand. Secondly, SVM has been widely used in many fields. Thus, we suggest that some deep learning methods should be used in the next experiments, such as deep neural networks with large dataset or the cascade random forests with small dataset.

Discussion
Useful information can be extracted from large amount of sequences by appropriate feature extraction methods. Meanwhile, the choice of classifier has an impact on the results of cytokines recognition. In order to obtain higher performance, we focused on the optimal feature combination and adjusted the hyper-parameter appropriately in our experiment, which can promote the characterization of cytokines and build a more appropriate classifier.
Although we achieved higher accuracy, we have some suggestions for further improvement. Firstly, feature selection using sparse regressions [35] like LASSO or elastic net can help further improve the performance and better understand. Secondly, SVM has been widely used in many fields. Thus, we suggest that some deep learning methods should be used in the next experiments, such as deep neural networks with large dataset or the cascade random forests with small dataset. Finally, another sequence-based feature split amino acid composition can also be used in new experiments [36,37].

Conclusions
We extracted the F n-gram features based on amino acid composition, the F PseAAC features based on physicochemical properties, the F pssm features based on evolutionary information and the F sss features based on secondary structure. Then we used the combinations of the above features, SVM and 10-flod cross-validation method to find the best performing vector for cytokines recognition. Finally we chose F psp with 448 dimensions as our feature vector and SVM with linear kernel function as our classifier. The experiments show that our method is superior to others. The results show that we get at least a 2.07% and at most a 5.96% increase in accuracy, at least a 0.42% and at most a 14.51% increase in sensitivity, at least a 0.43% and at most a 6.52% increase in specificity, at least a 9.57% and at most a 20.81% increase in F-score, at least a 9.99% and at most a 24.21% increase in Mcc.

Data
A total of 63,811 cytokine sequences were found and downloaded from the Uniprot database website (http://www.uniprot.org/) and used as positive sequences. We got all families of them, and selected the longest sequence from each family of the rest families, i.e., the families after removing the families of positives from all protein families, as a negative sequence. Here we totally obtained 10,118 negative sequences. Then we used the CD-HIT software to process positive sequences and negative sequences respectively in order to remove the highly similar sequences. To make balance of positive sequences and negative sequences in quantity, we set the positive threshold to 0.6 and set the negative threshold to 0.5, then we achieved 9163 positive sequences and 9327 negative sequences. We deleted the sequences which are so long that we can't obtain their secondary structures, and the sequences less than 20 in length. Because the extraction of some features requires the length of these sequences to be greater than or equal to 20. Finally, we obtained 9146 positive sequences and 9272 negative sequences.

Feature Extraction
In order to identify cytokines efficiently and build a stable model, we comprehensively considered four aspects of protein sequences, which are the composition of amino acids, physicochemical properties, evolutionary information of amino acids and the structure of protein sequences.
We first got 420 features from composition of amino acids by using the n-gram algorithm for n = 1 and n = 2, and obtained 30 features from physicochemical properties by PseAAC algorithm. Moreover, we computed PSSM by PSI-BLAST software and obtained 400 features about evolutionary information of protein sequences and amino acids. Finally, we extracted secondary structure sequences (SSS for short) and structure probability matrices (SPM for short) by PSI-PRED software to get 18 features about the structure of protein sequences. The process of feature extraction is shown in Figure 4.

n-Gram
The n-gram algorithm with a low complexity has been widely used in many fields [38,39]. We obtained features by counting the number of n consecutive amino acids in the sequence, where we set n = 1 and n = 2. Given a protein sequence P, L is the length of P and 1 2 ... n a a a stands for the spatially adjacent n amino acids. For every possible segment 1 2 n a a ...a in P, its feature value is computed in detail as follows: N is the maximum value of n, where N = 2, and gram. Finally, we obtained 420 features by the n-gram algorithm.

PseAAC
However, it is not enough only to take composition information into consideration for cytokines recognition. For example, the amino acid sequences HIDHIHI and HIHIDHI have the same feature vector in the n-gram algorithm, but they are two different sequences because of the different orders of amino acids in the sequences. Then PseAAC algorithm was proposed to solve this problem which is based on composition information, correlation of sequence order and physicochemical properties.
For a protein sequence, a total of 20 + λ features can be extracted by PseAAC algorithm and recorded by Formula 2, where i p is the ith feature value. The first 20 features are computed based

n-Gram
The n-gram algorithm with a low complexity has been widely used in many fields [38,39]. We obtained features by counting the number of n consecutive amino acids in the sequence, where we set n = 1 and n = 2. Given a protein sequence P, L is the length of P and a 1 a 2 . . . a n stands for the spatially adjacent n amino acids. For every possible segment a 1 a 2 . . . a n in P, its feature value is computed in detail as follows: N is the maximum value of n, where N = 2, and T(a 1 a 2 . . . a n ) represents the number that a 1 a 2 . . . a n appears in P. Therefore, there are 20 features for 1-gram and 400 (i.e., 20 2 ) features for 2-gram. Finally, we obtained 420 features by the n-gram algorithm.

PseAAC
However, it is not enough only to take composition information into consideration for cytokines recognition. For example, the amino acid sequences HIDHIHI and HIHIDHI have the same feature vector in the n-gram algorithm, but they are two different sequences because of the different orders of amino acids in the sequences. Then PseAAC algorithm was proposed to solve this problem which is based on composition information, correlation of sequence order and physicochemical properties.
For a protein sequence, a total of 20 + λ features can be extracted by PseAAC algorithm and recorded by Formula 2, where p i is the ith feature value. The first 20 features are computed based on the composition of the sequence, and other λ features describing the correlation of sequence order which are computed in terms of the hydrophilicity, hydrophobicity, and side-chain mass of amino acids. The calculation method is shown in Equation (9) and Equation (10): Here λ expresses the rank of correlation among amino acids of a protein sequence, w is the weight factor for the sequence order effect [40], and a i represents the ith amino acid in the 20 standard amino acids with the fixed order. H 1 (a i ), H 2 (a i ), M(a i ) are the values of hydrophilicity, hydrophobicity and side-chain mass of a i , which are calculated by Z-score transformation. The method is shown in Formula 5. µ and σ represent mean and standard deviation. f (a i ) is the frequency of a i which is normalized by using Z-score method. PseAAC algorithm has been implemented by PseAAC-Builder software [41,42] and we used the software to obtain a 30-dimensional feature vector when λ = 10 and w = 0.052.2.3 PSSM.
For a protein sequence P, PSSM of P can be computed by PSI-BLAST software, which contains the evolutionary information of P. The matrix is as follows. Additionally, PSSM can be used in another way [43], which calculates pairwise similarity based on both PSSM and PSFM.
where a i,j indicates the score of the ith amino acid in P evolving to the jth amino acid in the 20 standard amino acids. Next, we used the following formula to convert it into a value between 0 and 1: We calculated the average of each column of PSSM by Equation (14) and obtained 20 features. The feature vector is short for F pssm-20 : The above 20 features are extracted from the evolutionary information of a sequence, but there is no correlation between amino acids in the sequence. In order to overcome the shortcoming, Zhang et al. [17] proposed an algorithm based on PSSM and the correlation factor (b λ ) computed by PseAAC algorithm. The features are calculated by the following formulas: F s and F t , respectively, are the average of the s-th column and the t-th column of PSSM. There are two points that need to be noted in the algorithm. First, the sequence length should be greater than or equal to 20, otherwise L-g will be 0 (the denominator in Equation (15) will be 0). Secondly, the value of Equation (15) is 1 when s and t are equal, therefore the features are invalid for s = t and should be abandoned. That is to say, Equation (15) will be computed for s = t. Then 20 features (i.e., when s = t) will be deleted from all 400 features. Finally, a total of 380 features are obtained. The feature vector is short for F pssm-380 .

Secondary Structure
The secondary structure of protein has always been used to predict protein structural classes. Here we used it to recognize the function of proteins. The secondary structure of protein sequences are divided into helix, sheet, and coil, which can be abbreviated as H, E, and C.
PSI-PRED software can be used to extract the secondary structure sequence of a protein sequence, which is a sequence consisting of H, E, and C. We obtained three features by calculating the frequency of H, E and C by Equation (11) [44]. Here L is the length of the secondary structure sequence and T(H), T(E), T(C) are the number of H, E, C: We attained another three features with positional information computed by Equation (18) [44]. Here P H i , P E i and P C i are the position index (starts at 1) of the i-th H, E, C in the secondary structure sequence: Another three features can be computed by Equation (19) For a secondary structure sequence S, there will be the sequence S 0 with H and E after removing C from S. S 0 will be a sequence consisting of α and β when the consecutive H denoted by α and the consecutive E denoted by β [45]. Then a transition probability matrix (TPM) can be computed by Equation (20) [46][47][48][49][50] Six features can be extracted in terms of TPM, and the corresponding formula is as follows: F sss−10−15 = TPM(i, j), (TPM(1, 1) + TPM(2, 1)) 2 , (TPM(1, 2) + TPM(2, 2)) 2 |i, j = 1, 2 For a secondary structure sequence S, the SPM for S is shown by Equation (22), where a i,j is the possibility of the ith element of S predicted to the j-th element of C, H and E (1 ≤ i ≤ L, 1 ≤ j ≤ 3): 1 a 1,2 a 1,3  a 2,1 a 2,2 a 2 Then three features can be obtained by Equation (23). Finally, we obtain a total of 18 features based on the secondary structure of protein sequences: Author Contributions: Z.Y. proposed the method and designed the experiments; all authors analyzed the data; Z.Y. and J.W. wrote the paper.