SeqSVM: A Sequence-Based Support Vector Machine Method for Identifying Antioxidant Proteins

Antioxidant proteins can be beneficial in disease prevention. More attention has been paid to the functionality of antioxidant proteins. Therefore, identifying antioxidant proteins is important for the study. In our work, we propose a computational method, called SeqSVM, for predicting antioxidant proteins based on their primary sequence features. The features are removed to reduce the redundancy by max relevance max distance method. Finally, the antioxidant proteins are identified by support vector machine (SVM). The experimental results demonstrated that our method performs better than existing methods, with the overall accuracy of 89.46%. Although a proposed computational method can attain an encouraging classification result, the experimental results are verified based on the biochemical approaches, such as wet biochemistry and molecular biology techniques.


Introduction
Permeability is an intrinsic nature of a normal cell membrane. Not only the water and oxygen are allowed to flow into the cell freely, but also the carbon dioxide and other waste products (uric acid, water, and etc.) can pass through the cell membrane. The free radicals exist in metabolic process, X-rays, air pollutants, cigarette smoking, etc. [1]. They are unstable before they find atoms for neutralization. Since the skin is damaged outside every day, the free radicals are harmful to the cells of the skin. They can create a chain with the beginning of oxidative damage, and then the cells are destroyed.
Antioxidant proteins can neutralize free radicals to make them stable. Research shows that antioxidant proteins play an important role in terminating cellular and DNA damage caused by free radicals [2]. The damage caused by free radicals is the source of aging and various diseases [3][4][5]. Thus, research on antioxidant proteins has been paid more attention recently.
Although some micronutrients (vitamins) have been recognized as antioxidant molecules, such as vitamin E, vitamin C, etc., it is still necessary to identify effective proteins with antioxidative characteristics. Unfortunately, it is time-consuming to predict the antioxidant proteins by biochemical experiments. The computational method for prediction has been paid more attention recently, such as SNPdryad, used for predicting deleterious non-synonymous human SNPs (Single Nucleotide Polymorphisms) [6,7]. The computational methods used for identifying antioxidant proteins are expected, especially for the cases with large amount of protein sequence data. A method based on star graph topological indices was proposed to handle the problem [4], and the results are encouraging. However, the sequences in [4] are reused in the experiments, which the results are (1) A computational method called (SeqSVM) is proposed to predict antioxidant proteins, which is based on the primary sequence features proposed in [17]. The features are described by the physicochemical properties and sequence information of the protein, the dimensionality of the extracted features is 188, so the feature used here is called 188D. (2) There is redundancy in the 188D feature. In the manuscript, the features are selected by maximum relevance maximum distance method [28]. The features will be kept which can maximize the Pearson's correlation coefficient and the distance between attributes. The experimental results show that the performance of the method using selected features is competitive, or even better than that of the method using 188D. The experiments demonstrated that our proposed method performs better than existing methods with the accuracy of 89.46%. The best result of existing work is 74.79% proposed by Lin et al. [9].
The rest of the paper is organized as follows. The experimental results are discussed and analyzed in Section 2. Section 3 introduces the dataset used in the proposed work, the classification method, SMOTE processing, sequence representation, and performance evaluation. Finally, a conclusion is given in Section 4.

Comparison with Existing Methods
Our proposed method (SeqSVM) is compared with existing methods. Table 1 shows the comparison of our method with the existing method, on accuracy. The dataset is processed by SMOTE method to make a balance between the antioxidant samples and non-antioxidant samples in SeqSVM. For the purpose of removing the feature redundancy, the features are selected by max relevance max distance principle. In Table 1, the accuracy of our method with SMOTE processing and MRMD is 89.46%. Naive Bayes method is proposed to predict antioxidant proteins, and the accuracy of the method is 66.88% in jackknife test [8]. AodPred [9] is a method based on SVM classifier by using g-gap dipeptide features. The accuracy of AodPred based on g-gap dipeptides is 74.79% in jackknife test. Thus, the experimental results demonstrate that our method can attain high accuracy and classify antioxidant and non-antioxidant proteins efficiently. The time complexity of computation method depends on the classification method SVM, which is related to the number of training samples and the feature dimension.

The Comparison of Performance Evaluation on Feature Selection Methods
To further demonstrate the performance of our sequence-based method and the selected 132D features, the features are compared with g-gap dipeptides by using other classifiers provided by WEKA [29]. The feature set of 188D is reduced by MRMD method to 132D. MRMD method is a feature method, which is mentioned in Section 3.6. The performance of the features on different classifiers on sensitivity (Sn), specificity (Sp), and accuracy (Acc) are compared in Figures 1-3. In Figures 1-3, "Logistic" is short for logistic regression. J48 tree is a decision tree method based on C4.5. RF and SVM are short for random forest and support vector machine.
The Sn on 132D used Bayes net performs better than other methods. In the experiments, we can see that our method (188D and 132D using SVM) performs better than other classifiers using g-gap dipeptides, except SVM. However, Bayes net using 188D attains the highest Sn with 81.6%. The Sn of reduced 132D on Bayes net also performs better than that of AodPred. Figure 1 also shows that 188D and 132D perform better than g-gap dipeptides on most classifiers, which means that 188D and 132D are more robust than g-gap dipeptides. The figure also shows that the reduced 132D removes the redundancy, and can attain comparably high sensitivity on Bayes net and J48 tree. The sensitivity of 132D reduced features is higher than that of 188D on the other three classifiers. Thus, it is necessary to select features by max relevance max distance method.
The comparison of specificity with the features on different classifiers is shown in Figure 2. Our method (188D with SVM) performs better than that of AodPred (g-gap dipeptides) on specificity. The value of Sp of the reduced SeqSVM is higher than that of AodPred (g-gap dipeptides). G-gap dipeptides performs on Bayes net than 188D and 132D. The values of Sp using different features are comparable on Logistic, J48 tree, and RF classifiers.
In Figure 3, the accuracy of SeqSVM with 188D and 132D is better than that of AodPred (g-gap dipeptides SVM).

The Comparison of SeqSVM
The method of SeqSVM with SMOTE is compared to SeqSVM without SMOTE. The comparison of SeqSVM methods is shown in Table 2. The accuracy of SeqSVM before SMOTE is 85.98%, while the accuracy of SeqSVM is 88.68% after SMOTE processing. The accuracy of SeqSVM is improved by 3.1% after using SMOTE processing compared with SeqSVM without SMOTE processing. The

The Comparison of SeqSVM
The method of SeqSVM with SMOTE is compared to SeqSVM without SMOTE. The comparison of SeqSVM methods is shown in Table 2. The accuracy of SeqSVM before SMOTE is 85.98%, while the accuracy of SeqSVM is 88.68% after SMOTE processing. The accuracy of SeqSVM is improved by 3.1% after using SMOTE processing compared with SeqSVM without SMOTE processing. The

The Comparison of SeqSVM
The method of SeqSVM with SMOTE is compared to SeqSVM without SMOTE. The comparison of SeqSVM methods is shown in Table 2. The accuracy of SeqSVM before SMOTE is 85.98%, while the accuracy of SeqSVM is 88.68% after SMOTE processing. The accuracy of SeqSVM is improved by 3.1% after using SMOTE processing compared with SeqSVM without SMOTE processing. The

The Comparison of SeqSVM
The method of SeqSVM with SMOTE is compared to SeqSVM without SMOTE. The comparison of SeqSVM methods is shown in Table 2. The accuracy of SeqSVM before SMOTE is 85.98%, while the accuracy of SeqSVM is 88.68% after SMOTE processing. The accuracy of SeqSVM is improved by 3.1% after using SMOTE processing compared with SeqSVM without SMOTE processing. The accuracy of SeqSVM with SMOTE and MRMD is 89.46%, which the accuracy is improved by 4% compared with SeqSVM. The experimental results demonstrate that the performance of classifier can be improved by using SMOTE processing, when the number of class sample is imbalance. Although the computational methods can attain an encouraging classification result, the experimental results are verified based on the biochemical approaches, such as wet biochemistry and molecular biology techniques.

Benchmark Dataset
The dataset used in our work is generated and used by Feng et al. [8,30,31], and the data are selected from the UniProt database. For the purpose of selecting valid data, only the proteins that have been confirmed with antioxidative activities are selected, and the proteins with ambiguous meanings (such as "B", "X", "Z") are excluded. The benchmark dataset (S) is represented by positive subset (S + ) and negative subset (S − ), formulated as Equation (1).
where the symbol "∪" means the union in the set theory. There are 710 antioxidant proteins and 1567 non-antioxidant proteins left after the selection process. Furthermore, the selected sequences contain redundancy with high similarity. To avoid the overestimation of the methods, the homologous sequences with more than 60% similarity are removed by CD-HIT program [32] from the dataset.

Support Vector Machine
Support vector machine (SVM) is a supervised classification model. As we have known, SVM has been widely used in bioinformatics [9,[33][34][35][36][37][38][39][40][41][42][43][44][45][46], so here, we introduce it briefly. In linearly separable cases, the key idea of SVM is that a hyperplane is built to separate the two groups with a maximum margin. If the samples are non-linearly separated, the input variables are mapped into a high dimensional feature space by a kernel function. The principle of SVM is introduced in [47,48], and more details are provided in [49]. The SVM used in our work is the package named LIBSVM written by Chang and Lin [50]. Radial kernel function (RBF) is selected because of its effectiveness and efficiency. The regularization parameter, C, and the kernel width parameter, γ, are optimized by the grid search approach.

SMOTE Processing
There are 253 antioxidant proteins and 1552 non-antioxidant proteins in the dataset. The dataset is quite imbalanced for the reason that the positive samples and negative samples are not equally represented. SMOTE [51] is an approach to achieve a better result by oversampling the minority class and undersampling the majority class. The key idea of SMOTE is that a synthetic sample is created by oversampling method, instead of replacement. The minority class is composed of the minority class samples and the synthetic samples. The synthetic samples are generated along the line segments joining any or all of the K minority class nearest neighbors. If 200% samples should be oversampled, two out of K nearest samples will be chosen, and samples are generated on each direction of the chosen neighbors. The data are standardized after SMOTE processing.

Sequence Representation
188D vector was used to extract features of proteins by Cai et al. in 2003 [17]. The property of 188D includes the amino acid composition, distribution and physicochemical property. Due to the diversity of the amino acid, to extract the features clearly, the mentioned properties are divided into four classes. C1 means the percentage of amino acid (based on the amino acid class), C2 represents the percentage of amino acid (based on physicochemical property). There are 20 amino acids, so the dimension number of frequency of each amino acid is 20. The physicochemical property is represented by eight attributes, which are secondary structure, solvent accessibility, normalized Van der Waals volume, hydrophobicity, charge, polarity, polarizability, and surface tension. There are three values for each attribute, for example, the attribute of secondary structure can be described by EALMQKRH, VIYCWFT, or GNPSD, denoted by R ij (1 ≤ i ≤ 8, 1 ≤ j ≤ 3). The physicochemical property of proteins is shown as Figure 4. Thus, 24 attributes are used for describing the physicochemical properties. B describes the percent frequency of bivalent. There are three types of bivalent used for each property, denoted by R im R in , R im R io , R in R io (1 ≤ m, n, o ≤ 3). Thus, there are 24 dimensions on the eight physicochemical property attributes.

Sequence Representation
188D vector was used to extract features of proteins by Cai et al. in 2003 [17]. The property of 188D includes the amino acid composition, distribution and physicochemical property. Due to the diversity of the amino acid, to extract the features clearly, the mentioned properties are divided into four classes. C1 means the percentage of amino acid (based on the amino acid class), C2 represents the percentage of amino acid (based on physicochemical property). There are 20 amino acids, so the dimension number of frequency of each amino acid is 20. The physicochemical property is represented by eight attributes, which are secondary structure, solvent accessibility, normalized Van der Waals volume, hydrophobicity, charge, polarity, polarizability, and surface tension. There are three values for each attribute, for example, the attribute of secondary structure can be described by EALMQKRH, VIYCWFT, or GNPSD, denoted by Rij (1 ≤ i ≤ 8, 1 ≤ j ≤ 3). The physicochemical property of proteins is shown as  Given a protein sequence with length L, the percent of the amino acids of a particular property located at the first, 25%, 50%, 75%, 100% is measured as the distribution of the protein. There are 24 attributes used to describe the physicochemical properties. The distributions of amino acids are represented by 120 attributes, by the reason that there are five values on each attribute. Above all, the total number of attributes for protein representation is 188. In fact, it is obvious that not all of the 188 features will be used for prediction. There is redundancy between the features. Thus, the features are selected by max relevance max distance method proposed by Zou [28].

Performance Evaluation
Sensitivity (Sn), specificity (Sp), and accuracy (Acc) are used to measure the classification quality. Sensitivity is used in Chou's work [52][53][54][55], and represents the sensitivity, which is calculated by Equation (2). Specificity is the specificity of the algorithm, which is measured by the rate of misclassification of the antioxidant proteins. The calculation of Sp is shown as Equation (3). Assessments of Sp or Sn, individually, are not sufficient to evaluate the performance of a method. The overall accuracy is calculated by Equation (4). Given a protein sequence with length L, the percent of the amino acids of a particular property located at the first, 25%, 50%, 75%, 100% is measured as the distribution of the protein. There are 24 attributes used to describe the physicochemical properties. The distributions of amino acids are represented by 120 attributes, by the reason that there are five values on each attribute. Above all, the total number of attributes for protein representation is 188. In fact, it is obvious that not all of the 188 features will be used for prediction. There is redundancy between the features. Thus, the features are selected by max relevance max distance method proposed by Zou [28].

Performance Evaluation
Sensitivity (Sn), specificity (Sp), and accuracy (Acc) are used to measure the classification quality. Sensitivity is used in Chou's work [52][53][54][55], and represents the sensitivity, which is calculated by Equation (2). Specificity is the specificity of the algorithm, which is measured by the rate of misclassification of the antioxidant proteins. The calculation of Sp is shown as Equation (3).
Assessments of Sp or Sn, individually, are not sufficient to evaluate the performance of a method. The overall accuracy is calculated by Equation (4).
If N + − = 0, this means that all antioxidant proteins are recognized, and the sensitivity Sn = 1. Similarly, if N − + = 0, this means that none of the non-antioxidant proteins are misclassified as antioxidant proteins, and the value of specificity Sp = 1. Equations (9)-(11) can be rewritten as From Equations (9)- (11), it is obvious that if N − + = N + − = 0, which means that none of the antioxidant peptides or the non-antioxidant peptides are misclassified. Thus, there is Sn = Sp = Acc = 1. The values of Sn, Sp, and Acc are larger, and the performance of the method is better.
In the experiments, the predictors are evaluated by the jackknife cross-validation [56]. There are three cross-validation test methods used in the literature, which are independent dataset test, K-fold cross-validation (i.e., 5-fold cross-validation or 10-fold cross-validation) and jackknife cross-validation test [56]. Jackknife test is considered as the least arbitrary and most objective [57]. The advantage of jackknife test has been demonstrated in that it can give a unique output for a given benchmark dataset.
In this work, we use maximum relevance maximum distance (MRMD) [28] to remove the redundancy of features. The objective function of MRMD is shown as Equation (12). If m −1 features have been selected, the m-th feature will be selected if the i-th feature maximizes Equation (12).
where MR i is the relevance between the features. The relevance is measured by the Pearson's correlation coefficient, shown as Equation (13).
, (13) where N is the number of vectors, and x(y) is the average value on the k-th dimension. MD i is used to measure the level of similarity between two feature vectors. In our experiments, the maximum distance is calculated as the mean of the Euclidean distance (ED), cosine distance (COS), and Tanimoto coefficient (TC) (shown as Equation (16)). The distances used are defined as follows.
where M is the number of features. The distance is calculated on each dimension, and the feature will be selected with the maximum distance by satisfying the condition of Equation (17).

Conclusions
Antioxidant proteins can terminate the cellular and DNA damage caused by external sources, such as exposures to X-rays, ozone, cigarette smoking, and others. The study of antioxidant proteins has drawn attention in recent years. The computational methods have been proposed to identify the antioxidant proteins, and the results are encouraging. In our work, a method based on primary sequence information, using SVM, is proposed to predict antioxidant proteins, and the experimental results show that our method performs better than existing methods. The contribution of our work is that a computational method is proposed to predict antioxidant proteins, and the classification accuracy of the method is better than some existing methods. Since there are publicly accessible web servers provided for practical models [62][63][64][65][66], the web server for identifying antioxidant proteins based on our method will be developed later to help the researchers identify the antioxidant proteins. We will also extend our work to other organism in our future work, such as E. coli/S. cerevisiae/D. radiodurans in UniProt database.
Author Contributions: L.X. initially drafted the manuscript and did most of the codes work and the experiments. C.R.L. collected the features and analyzed the experiments. S.S. and G.L. revised to draft the manuscript. All authors read and approved the final manuscript.