Applied Soft

Support vector machines have a wide use for the prediction problems in life sciences. It has been shown to offer more generalisation ability in input–output mapping. However, the performance of predictive models is often negatively inﬂuenced due to the complex, high-dimensional, and non-linear nature of the post-genome data. Soft computing methods can be used to model such non-linear systems. Fuzzy systems are one of the widely used methods of soft computing that model uncertainties. It is formed of interpretable rules aiding one to gain insight into applied model. This study is therefore concerned to provide more interpretable and efﬁcient biological model with the development of a hybrid method that integrates the fuzzy system and support vector regression. In order to demonstrate the robustness of this new hybrid method, it is applied to the prediction of peptide binding afﬁnity being one of the most challenging problems in the post-genomic era due to diversity in peptide families and complexity and high-dimensionality in the characteristic features of the peptides. Having used four different case studies, this hybrid predictive model has yielded the highest predictive power in all the four cases and achieved an improvement of as much as 34% compared to the results presented in the literature. Availability: Matlab scripts are available at https://github.com/


Introduction
Peptide binding plays vital roles in the molecular biology of the cell.The process of the peptide binding can activate the cytotoxic T-cells in the immune system [1].One of the most challenging and complex aspect of the peptide binding is the prediction of protein-peptide binding affinity.These bindings are very crucial in that they induce cellular immune responses [2].On the other hand, due to diversity of peptide families, there are quite large number of peptides available and still being discovered (e.g., potentially over 512 billion peptides for each MHC molecule [3]).
Biological experiments for the measurement of the binding affinity between proteins and peptides are costly and timeconsuming.In this regard, computational methods are of particular interest in bioinformatics for finding feasible approaches to this problem [4,5].Predictive models in the identification of peptide binding affinity are often used to find out whether a binding exists between peptide and MHC molecule [6].The qualitative models further improved and focused on modelling to classify binders as strong and weak binders [7][8][9].Recent research efforts have been focused on quantifying the binding predictions.Additive model is one of the earliest quantitative approaches that is proposed to model MHC-peptide for finding precise binding affinities [10].After that, studies are focused on non-linear approaches and they achieved a better performance compared to linear models such as the additive method.Non-linear modelling approach has been taken by a number of later methods such as regularisation methods [11], partial least squares [12] and random forests [13] to reveal the real-value of the binding affinity.However, complexity and nonlinearity that exist in such data sets have led the necessity of more robust and sophisticated methods.
Fuzzy systems are able to model uncertain and imprecise knowledge in complex and non-linear data sets, and form a structure for representing human reasoning.Among various fuzzy systems, Takagi-Sugeno-Kang (TSK) is commonly used for modelling complex systems [14,15].TSK fuzzy systems (TSK-FS) can be combined with other methods, particularly learning methods, and enhanced with learning and adaptation capabilities [16].In TSK models, rule antecedent is in the form of membership functions and the rule consequent is a linear function of inputs.Although there are many methods proposed to model TSK-FS, general approach is to keep the premise parameters constant whereas values of the consequent parameters are computed.This computation is done by least square estimation (LSE) which is a statistical modelling that assumes a linear relationship that exists between input and output variables.LSE is based on the minimising the empirical risk and constitutes an essential part of the TSK fuzzy systems [17,18].One drawback of least squares learning algorithm is that even though the training error is minimised, the model can badly suffer from the overfitting [19].However, there are methods that have been explored for addressing the problems in the least square estimation (e.g., neuro-fuzzy systems [17], genetic-fuzzy systems [20]).
Support vector regression (SVR) [21,22] is an efficient and robust method and provides high generalisability and performance.Applications of SVR have demonstrated considerably better modelling of various non-linear systems and minimising the structural risk than least squares approach.It is considered that, this concept can be incorporated with TSK-FS in order to better train its consequent part [23].However, there are not many methods reported in the literature for the utilisation of support vector based methods at the consequent part of the fuzzy system [24][25][26].
In this paper, a support-vector based Takagi-Sugeno-Kang fuzzy system (TSK-SVR) is proposed and applied to the quantitative prediction of the binding affinities between major histocompatibility complex proteins (MHCs) and peptides which is an important problem in biology and medicine with applications for drug design.This paper extends the initial work [27] and improves initial results by yielding as much as 34% improvement in prediction accuracy than what has been presented in the recently published papers.The rest of the paper is organised as follows.Section 2 introduces the peptide binding affinity problem.In Section 3 background methodology is explained.Section 4 presents the SVR-based TSK type-1 fuzzy prediction model.Section 5 presents the results and discussion.Finally, conclusions are drawn in Section 6.

Peptide binding affinity
This section presents the problem statement and data sets to be used.

Problem statement
A peptide presented by MHC class I molecules is a short number of amino acid sequence that generally contain eight to eleven amino acids [28].Peptides bind to protein molecules in order to induce cellular immune responses.Affinity indicates the tendency or strength of the binding.As there is a quite larger number of peptides (potentially over 512 billion binding peptides for each MHC molecule [3]), there is a need for prediction methods to help determine binding affinities of these peptides.In addition, in order to avoid this time consuming task, a computational predictive model should be developed.The difficulty of the peptide prediction problems when building a prediction model is the number of features being very large (in this study ≥5000) whereas the number of peptides in the training data set is relatively small (in this study ≤150).

The data sets
The high-dimensional peptide data sets provided at the comparative evaluation of prediction algorithms (CoEPrA) modelling competition [29] were used in this study in order to further improve predictivity of the affinity of peptides and, in particular, to test predictive capability of the proposed TSK-SVR model for the given data sets.As shown in have been provided for each small peptide (for both calibration and prediction data sets).
In addition to two different amino acid data sets used in the literature that consists of physico-chemical and bio-chemical properties of amino acids (e.g., AAindex database [30] and CISAPS [31]), to be consistent with the CoEPrA, each amino acid in a peptide is described by 643 descriptors.It should be noted that, these descriptors were picked mostly from AAindex database.Task 2 consists of octa-peptides that have a total of 5144 (643×8 = 5144) descriptors whereas all other tasks have nona-peptides that have a total of 5787 (643×9 = 5787) descriptors (Table 1).The statistics (range, mean and standard deviation) of the binding affinities of the peptides of each task are given in Table 2.

Background methodology
The proposed approach consists of a number of components to be implemented for the prediction of peptide binding affinity.This section will provide background information related to these components.

Takagi-Sugeno-Kang fuzzy system
The Takagi-Sugeno-Kang (TSK) fuzzy system rules are defined as conditional statements that are presented by using a linear function in the consequent part.A fuzzy rule-base with n input variables (x 1 , x 2 , . .., x n ), r rules can be written as: where A nr is a fuzzy set for the input variable n and rule r, generally represented by a membership function, and y r is a linear function in the consequent part as expressed in: where m 0 , m 1 , m 2 , . .., m n are the coefficients of input variables (x 1 , x 2 , . .., x n ).In the TSK model each rule generates a crisp output and then the final output is obtained by aggregating all the rule outputs.This process is called defuzzification, and the weighted average defuzzification value y can be defined as: where f i and fi are the firing strength and normalised firing strength of the fuzzy rule, respectively, and f i is determined by using a t-norm operator that can be defined as: where (x j ) is the membership degree of input variable x j .The fuzzy sets (e.g., A ij ) can be described by any form of membership functions.In this study, Gaussian membership function is used as expressed in: where c and are the centre and standard deviation, respectively.

Support vector regression
Support vector machine (SVM) is a statistical learning architecture based on the structural risk minimisation [32].SVM learning algorithm finds the optimal separating hyperplane by training a classifier for a given training data.The optimal separating hyperplane is the one that maximises the margin between two classes.SVMs can be generalised to perform regression using its linear model.Other than the traditional square error loss function, the -insensitive loss function is used in SVR [33].This chosen error function tolerates errors up to .One other advantage of using this error function is its tolerance against noise.SVR searches for a linear function h(x) where w and b represent the coefficients of the weight vector of the linear expression.This linear function is constrained to the following mathematical expressions: subject to where two types of slack variables + and − measure the deviations of training samples out of the -region [22].The values of these variables are computed during the training of SVR as in (9).The parameter C is a pre-specified value and works as a regularization factor between minimising w and up to the value which deviations greater than can be tolerated.Certain training instances are chosen to be support vectors.Then, the weighted sum of the support vectors are used to define the regression and adequately model data.

SVR-based TSK type-1 fuzzy prediction model
This section presents the implementation of the SVR-based TSK type-1 fuzzy prediction model.The flowchart of the proposed model is shown in Fig. 1.

Preprocessing
The amino acids of the peptides that form the data set turned into numerical descriptors using amino acid indices.Then the analysis started with normalising the data set in order for every feature to fall within the same range of values.The descriptors are normalised to a scale in the interval [0, 1] as expressed in (10).

Reducing the high dimensionality
Feature selection is a process to reduce dimensionality by choosing a subset of relevant features leading to a better performance of the system or the model.In this regard, feature selection algorithms are widely used in bioinformatics aiming at finding the least number of features that improve the accuracy and performance of the models [34].There are several feature selection methods available.In this study, the problem of feature selection is addressed by utilising the multi-cluster feature selection (MCFS) [35] for the proposed model as its superiority has recently been shown over different application domains [36][37][38].MCFS is an unsupervised feature selection method and uses information contained in eigenvectors by solving the generalised eigen-problem to preserve the multi-cluster structure of the data.In this study, the number of used eigenvectors parameter of MCFS is set to the number of features to be selected.

Identifying antecedent parameters
Fuzzy c-means (FCM) method partitions data set into a number of clusters in a way that each data object is assigned a degree of membership for each cluster [39].The FCM model aims to minimise an optimisation function.The clustering process iteratively calculates cluster centres and degrees of memberships of each data point until the optimisation function is satisfied or the number of iterations reaches a preset value.
For construction of rule-base and membership functions to automate the rule-based fuzzy system, clustering based methods have been commonly used, in particular, for type-1 fuzzy systems [40,41].The fuzzy sets involved in the rules are fully characterised by their membership functions.As explained in Section 3.1, the Gaussian membership function was utilised to develop the fuzzy rule base.The centroids of the clusters and their corresponding standard deviations obtained from FCM are used to determine the values of the parameters of the Gaussian membership functions.In this study, the degree of fuzzification is chosen to be two for FCM and number of clusters have been used to determine the number of rules.

Identifying consequent parameters
The least square estimation is a common method used to compute values of the consequent parameters of TSK-FS [18].Given the support vector regression concept with a linear kernel, this can be potentially utilised to compute values of the consequent parameters of TSK-FS.The variables ( fi , fi x i1 , fi x i2 , . .., fi x in ) defined using the normalized firing strength in (4) form inputs to SVR to derive w parameters that correspond to the consequent parameters in TSK-FS.Finally, SVR-based TSK-FS can be formulated by combining (11) and (12). where y ′ now represents the formulation of the SVR-based TSK-FS.For the sake of simplicity, in order to implement support vector regression part, LIBSVM library was used [42].

Searching the optimal parameters
There are three important parameters that are likely to affect the performance of the models.They are C and used to optimise the SVR linear kernel part, and the number of rules (i.e., clusters) for the TSK-FS.Due to the fact no generally accepted methods exist to determine these parameters optimally, the grid-search method has been decided to be employed as a parameter selection method in order to find the optimal parameter set.The grid-search method is simple and reliable and allows to implement parallel computations.The parameter range is searched with a step size of 0.05 for finding the optimal SVR kernel linear parameters.For the features, the search range was decided to be between 1 and 250.It is hoped that these ranges broadly cover all the possibilities that may contain optimal measure.Therefore, these parameters as well as different combinations of the features are assessed and their results were presented.Fig. 2 depicts how the grid-search conducted on SVR kernel parameters (C and ) for their given ranges.

Performance measurements of the prediction models
There are different measurements used to assess capability of the predictive models.However, in order to maintain consistency over the published results and perform consistent comparison, the following measures; coefficient of determination (q 2 ) and spearman rank correlation coefficient () are used that can be expressed as: where y exp and y prd are the expected and predicted values of the peptide binding affinity, respectively, n is the number of peptides and ȳexp is the mean of all expected values in the data set.The measure q 2 is a statistical model based upon the proportion of variability in a data set [43].When q 2 is close to 1 it suggests a model that has been successfully constructed.Negative q 2 values indicate that model poorly approximates the expected values.The spearman rank correlation coefficient () [44] is used to measure the statistical dependence between two variables.The value of ranges between +1 and −1 showing perfect correlation at each end.The measures are calculated for each task (both training and testing).The metric q 2 is used to assess performance of the predictable models for the first three tasks whereas the fourth task was assessed by in the competition.

Experimental results and discussion
The results of the experiments carried out will be discussed in three sub-sections.In the first part, the robustness of the proposed hybrid TSK-SVR method will be demonstrated over the four data sets.In the second part, SVR and TSK-SVR are compared in order to demonstrate the performance with and without the fuzzy concept.The latter part will present the outcome of the feature selection methods showing amino acid locations and amino acid scales.

TSK-SVR results
In this section, the results of the proposed model (support vector based fuzzy system) are presented.To test the performance of the proposed model, four peptide data sets obtained from the (CoEPrA) competition are used.The proposed approach takes into account of predictive problem with very large number of attributes rather than simulated or practical data sets which only have very small number of features and consist of noise-free samples.
The data sets that have been used almost contain over 5000 descriptors for each peptide.One difficulty for the analysis of post-genome data is the curse of dimensionality.The curse of dimensionality is a term usually related to significant challenges that may occur when working with high-dimensional data sets [45].Small sample size is another important characteristics of the peptide data sets.As a consequence, the high-dimensional nature of the data negatively effects the performance of the prediction methods and the proposed approach also has no exceptions.Since thousands of features are available for peptides, a feature selection process is integrated to the proposed model as an initial step to obtain low dimensional feature space.MCFS was able to deal with large number of attributes of the peptide data sets efficiently, and the reduced feature subset was used as input variables of the rule-based fuzzy system.
FCM is used in this study to construct the fuzzy rule-base.The centroids of the clusters and their corresponding standard deviations obtained from FCM are used to design Gaussian membership functions of the fuzzy models.The rule-base for the fuzzy systems (in this study, TSK fuzzy system) can be driven by using clustering methods where each cluster generally represents a fuzzy rule.Therefore, the number of clusters is equivalent to the number of rules in the fuzzy system.Determining the optimum number of clusters (consequently, number of rules) in the clustering methods can be generally achieved by considering the outcome of experimental studies where different number of clusters is explored and the cluster structure that yields the best outcome (e.g., minimum error) could be regarded as the best set of clusters.Following this concept, we studied the number of clusters from 2 to 7. It should be noted that the cluster centres and the membership matrix is randomly initialised in the fuzzy clustering stage.Thereby, random initialisation in FCM may have some effect on the performance.
Along with the number of rules (clusters), further experiments were carried out to find optimum values of the parameters of TSK-SVR model for each rule structure as demonstrated in Figs.4-9.In addition, one common problem in the support vector based approach is that it is not easy to determine which kernel function can be used [46].In this study, SVR is trained with a linear kernel to learn the parameters of the consequent part of the fuzzy model.Therefore, the parameters C and are required to be optimised.The optimisation of SVR parameters was achieved by the grid-search where several thousands of the values of the parameters were tested over each rule base in order to find the best set of the values of the parameters for each rule base that yields the highest q 2 (first three tasks) and (last task).The grid-search is repeated for each of the feature selection step and then, at the end of the process, the best model is selected.
For Task 1, graphs show fluctuations and reach local maximums particularly in the first 100 features.They rose gradually then and reach the global maximum at 161 features.After reaching the global maximum they become steady.For Task 2, graphs increase gradually as the number of features selected grew.They reach local maximums in the first 75 features and reach the global maximum at 247 features (at 246 features in Fig. 4).For Task 3, slight fluctuations are observed throughout the graphs, reaching local maximums in the first 150 features and then reaching global maximum at 165 features (at 172 features in Fig. 5).For Task 4, substantial fluctuations are observed throughout the graphs, reaching local maximums after 50 features until reaching global maximum at 141 features (at 121 features in Fig. 9).As an example, illustration of the fuzzy rules for Task 4 (with two rules only) is provided in Fig. 3.As this model has 141 features and is not possible to fit in the paper, only three features were presented.The rules should be read using IF before each parameter's membership function, AND between the parameters, and finally THEN before the consequent part is defined.
For each rule-base the proposed method is able to build a robust and interpretable fuzzy system for a high-dimensional data set with a relatively small number of data samples.It is observed that an optimum predictive model for each task was obtained by using

Table 3
Prediction results of TSK-SVR for each rule-base.For each rule, two results are presented.The former shows the best results obtained with the lowest possible feature set as compared to literature.The latter shows the best result and its number of features.

Number of rules
Task 1 Task 2 Task 3 Task 4   different sets of rules as presented in Table 3.While the number of rules needed was smaller for Tasks 2, 3 and 4, Task 1 seems to require more rules to obtain the best possible outcome.Only two rules for Task 4 and three rules for Tasks 2 and 3 were enough to yield the best performance whereas the fuzzy-rule base with six or seven rules seems a requirement for the optimum modelling of Task 1.
The outcomes of the experiments clearly highlighted the strengths of TSK-SVR.The fuzziness has positively contributed towards the modelling of the tasks.To illustrate the performance of the proposed hybrid method, it is compared to the recently published results.In the (CoEPrA) competition Task 1 and 2 contained more than ten participants.Task 3 and 4 contained more than five participants.As shown in Table 4, the results outperform the competition results in which each participant competed with their best model (e.g., SVR, RF, PLS) [29].In addition, for each task the results obtained are comparatively better than the recent studies presented in [11,29,12] and [13].As compared to the best model presented in the literature, the predictive performance for Tasks 1, 2, 3 and 4 have been improved by 0.7%, 11.2%, 33.6% and 9.7%, respectively.The overall improvement gain for all tasks is found to be 13.8%.

Comparison of SVR and TSK-SVR
There have been a number of studies that present the prediction of peptide binding affinity by using SVR-based analysis.As TSK-SVR is a hybrid method that combines SVR with a fuzzy-rule base, namely TSK in this study, it will be important to compare the performances of SVR with and without the fuzzy concept.As detailed in Table 5, there is clear evidence over all the tasks that, based on the recent literature where SVR has been used for the prediction of the same data sets with the same training and test cases, the proposed TSK-SVR algorithm outperforms its solo version and

Table 4
Prediction results of TSK-SVR compared to the results found in the literature.The performance of the method along with its selected number of features (f) are presented.Those methods that do not report the number of features for their models remained not available.

Table 5
The parameters and correlation coefficient results of SVR and TSK-SVR.
yields an improvement of 2.1%, 16.3%, 33.6% and 13.8% for each of the tasks, respectively.This outcome demonstrates superiority of the proposed hybrid approach in mapping the input on the output over this challenging high-dimensional regression problem.The optimal parameters of TSK-SVR for the peptide binding affinity tasks are found to be C = 2.40, = 0.05, and rule size of seven for Task 1; C = 1.90, = 0.10, and rule size of three for Task 2; C = 1.45, = 0.90, and rule size of three for Task 3; and C = 2.30, = 0.45, and rule size of two for Task 4. The TSK-SVR models contained 161, 247, 172, 141 features for each peptide binding affinity task, respectively.It is worth noting that, our approach (TSK-SVR) not only benefited from SVR-based training but also handled the uncertainties in the peptide binding data set using the fuzzy modelling.

Analysis of selected descriptors
The SVR-based experiments were carried out for four different peptide affinity data sets.For each rule-base (rules that range between two and seven), feature selection (between 1 and 250 features) was conducted to reduce the number of features.The amino acid features that contributed most to the efficiency of the proposed models are given in Tables 6-9.
For Task 1, eight amino acid features contributed to the output in more than three separate locations.The amino acid feature numbered with 481 (hydrophobicity coefficient in reversed phase high performance liquid chromatography) contributed highest as it is represented in seven separate locations on each of the nonapeptide within the data set.
For Task 2, eleven amino acid features contributed to the output in more than four separate locations.The amino acid feature numbered with 364 (Zimm-Bragg parameter sigma × 1.0E4) contributed highest as it is represented in seven separate locations on each of the octa-peptide within the data set.
For Task 3, nineteen amino acid features contributed to the output in more than two separate locations.The amino acid features numbered with 110 (composition), 338 (relative preference value at C"), 376 (relative population of conformational state A), 405 (normalized positional residue frequency at helix termini N") contributed highest as they are represented in four separate locations on each of the nona-peptide within the data set.
For Task 4, ten amino acid features contributed to the output in more than two separate locations.The amino acid features numbered with 306 (average relative fractional occurrence in A0(i − 1)) and 338 (relative preference value at C") contributed highest as they are represented in four separate locations on each of the nonapeptide within the data set.
The amino acid feature numbered with 400 (polarity) appeared in Task 1, Task 2 and Task 3 as a common feature with location occurrences of 4, 6 and 3, respectively.Therefore, the polarity of an amino acid is considered as one of the highly discriminating feature in these data sets.The results also appear to suggest that different sets of amino acid descriptors effect the result, and that exploration of the feature selection methods may further help accelerate the predictive power of the proposed hybrid method.

Conclusions
In this paper, a hybrid system (TSK-SVR) that has helped improve the predictive ability of TSK-FS significantly with the aid of support-based vector method was developed and demonstrated with the successful applications in the prediction of peptide binding affinity being regarded as one of the difficult modelling problems in bioinformatics.As far as an algorithmic approach is concerned, two important conclusions can be driven: • SVR is enhanced by adding the fuzziness concept.
• TSK-FS is benefited from SVR-based training.
Predictive performances have been improved as much as 34% when compared to the best performance presented in the literature.The overall improvement gain for all tasks is found to be 13.8%.Apart from improving the prediction accuracy, this research study has also identified amino acid features "Polarity," "Hydrophobicity coefficient," and "Zimm-Bragg parameter" being the highly discriminating features in the peptide binding affinity data sets.Therefore, these amino acid features may be potentially considered for better design of peptides with appropriate binding affinity.
The developed hybrid framework used for non-linear system modelling is based on TSK fuzzy model, consequent part of which is formed by a set of linear equations.As the support vectors in SVR were used to help form the consequent part of the model, it can be extended to type-2 fuzzy system with a closed-form type reduction and defuzzification method where Biglarbegian-Melek-Mendel (BMM) based type-2 fuzzy system could be explored as an example [47,48].Similarly, the concept could be further generalised to explore type-n fuzzy system for which the defuzzification phase could be performed using such approach.Further research is being carried out in this direction.

Fig. 1 .
Fig. 1.Flowchart of the TSK-SVR model for the prediction of peptide binding affinity.

Fig. 2 .
Fig. 2. Illustration of grid-search for TSK-SVR to find the optimum values of parameters (C and ) and prediction performance (q 2 or ) based on the selected features for the peptide binding affinity Tasks 1-4.For simplicity, the grid-search iterates C parameter with a step size 0.05 in the range 0.05-3.00while remained fixed.(a) Task 1 with seven rules and 161 features yielded a q 2 value of 0.696 with SVR parameters C = 2.4 and = 0.05.(b) Task 2 with three rules and 247 features yielded a q 2 value of 0.743 with SVR parameters C = 1.9 and = 0.10.(c) Task 3 with three rules and 172 features yielded a q 2 value of 0.310 with SVR parameters C = 1.45 and = 0.90.(d) Task 4 with two rules and 141 features yielded a value of 0.643 with SVR parameters C = 2.3 and = 0.45.

Fig. 3 .
Fig. 3. Illustration of the TSK-SVR model and its parameters for Task 4 with two rules.As this has 141 features and is not possible to fit in the paper, only the first three normalised features were presented.The rules should be read using IF before each feature's membership function, AND between the features, and finally THEN before the consequent part is defined.The Gaussian membership function parameters were specified by two parameters (i.e., [standard deviation, mean]).Number of support vectors that determines the coefficients of linear expression (w and b) is 79.

Table 1
each task contains calibration (training) and prediction (test) data sets and physico-chemical descriptors

Table 1
General characteristics of the data sets used for the prediction of peptide binding affinity.

Table 2
The statistical characteristics of the values of peptide binding affinities.

Table 9
Top most frequent amino acid features for the optimal model of Task 4.