Classifying ordered-disordered proteins using linear and kernel support vector machines Düzenli-düzensiz proteinlerin doğrusal ve kernel destek vektör makineleri kullanarak

Introduction: Intrinsically disordered proteins occur when the deformations happen in the tertiary structure of a protein. Disordered proteins play an important role in DNA/RNA/protein recognition, modulation of specificity/ affinity of protein binding, molecular threading, activation by cleavage. The aim of the study is the identification of ordered-disordered protein which is a very challenging problem in bioinformatics. Methods: In this paper, this kind of proteins is classified by using linear and kernel (nonlinear) support vector machines (SVM). Results: Overall accuracy rate of linear SVM and kernel SVM in identifying the ordered-disordered proteins are 86.54% and 94.23%, respectively. Discussion and conclusion: Since kernel SVM gives the best discriminating scheme, it can be referred that it is a very satisfying method to identify ordered-disordered structures of proteins.


Introduction
Protein tertiary structure is the interactions of the helices and sheets [1]. An ordered protein transforms to a disordered protein when this tertiary structure starts to deform. This lack of ordered structure provides a flexible and random-coil composition to the proteins. Therefore, they interact physically and functionally with their target partners. Hence, disordered proteins become a part of cellular regulation processes such as cell signaling, transcription, translation and chromatin remodeling functions [2][3][4]. In addition, the conservation state of their open structure in case of interfering with their target, being able to adopt different structure on different partners, their interaction rate enhancing features and finally their proteolytic sensitivity make disordered proteins important part of proteomics field. Furthermore, there is a strong relationship between disordered protein structure and many diseases. For instance, collection of disorder protein structures causes synucleinopathies, neurodegenerative diseases, cardiovascular diseases, diabetes, neural diseases [5][6][7]. Determination of protein structures as ordered-disordered can have been done by protein crystallography, electron microscopy, nuclear magnetic resonance (NMR) spectroscopy, small-angle X-ray and neutron scattering [8]. However, experimentally pursuing these processes are costly and time-consuming. For this reason, fast identification and classification techniques have continuously been improving day by day to effectively classify the proteins whether they are ordered or disordered [9][10][11][12].
One of the most used classification method is the support vector machine which is commonly used in bioinformatics [13][14][15][16]. Support vector machine (SVM) can be extended to the nonlinear structure as kernel. In this study kernel SVM is investigated for ordered-disordered proteins.
In literature, linear SVM is also used in bioinformatics problems [17,18]. However, in real-world data may not be distributed normally. Kernel SVM is very useful method in order to classify nonlinear data [19]. SVM used for identifying the positions in the amino-acid sequence [20,21]. SVM also used for prediction of protein secondary content [22].
In this study, linear and kernel SVM will be applied for classify the ordered-disordered proteins. This is very important problem in bioinformatics because of the significance of ordered-disordered proteins.

Support vector machine (SVM)
Classifying data is one of the most important tasks in machine learning. SVM is one of the basic machine learning techniques. It is a robust method to classify the data by using its class label. The idea of SVM is to create a hyper plane in between data sets to indicate which class it belongs to, as shown in Figure 1.
SVMs can separate the classes using the labeled training data with a hyper-plane in a high dimensional space which have maximum distance to the nearest training data point of any class.
In order to perform SVMs as a classification technique, SVMs separate a given known set of {-1, +1} labeled training data via a hyper-plane that is the largest distance from the positive samples and negative samples [23]. It is also possible to classify nonlinear dataset by using the kernel method. In kernel method, data is mapped into nonlinear feature space F and the hyper-plane found by the SVM in feature space F.

Linear support vector machine
Training set is selected for SVM in a two-class classification problem which can be separated linearly.
The following equations give us inequalities belonging to the optimum hyper-plane [24]: The aim is to find the best hyper-plane that is the one that represents the largest separation, or margin, between the two classes as seen in Figure 2.
Points which create the hyper-plane, the so-called support vectors and referred as follows: The following optimization problem can be solved to maximize the boundary of optimum hyper-plane.
We can describe the decision functions as below: for belongs to 1 first class The limitations can be expressed as follows: The optimization problem can be solved by using Lagrange Equations and the following obtained equality [17]: Consequently, the decision function can be written as follows:

Kernel support vector machine
Linear was introduced by Vapnik in 1963. Then, 'Kernel Method' was introduced by Bernhard E. Boser and Vladimir Vapnik to solve nonlinear classification problems. 'Kernel Method' uses 'Kernel Trick'. Kernel trick provides bridge linearity to nonlinearity [26]. Training data X is mapped into a high-dimensional feature space F by using 'Kernel Function'. Next, the classification rule is transformed into the following form [27]: K is a symmetric and positive definite function. In this study, 'Gaussian Kernel' function used: The kernel function used for determine the optimum parameters. The problem can be solved using Lagrange equations [17]: then the decision function found as below: In this study, SVM was used with radial basis function (RBF) as the kernel function.

Dataset
The sequences of the disordered and ordered proteins were extracted from DisProt [28]. Our data set consists of 114 protein sequences; 57 of them ordered and 57 of them disordered. Sixty percent of data is randomly selected for training, and, the remaining is used for validation processes.
The protein dataset is transformed i th amino acid composition by: where F(x) is the vector contained each frequency of 20 amino acid types [29]. Therefore, the amino acid frequencies were calculated as follows. The percentage of the amino acid residue i in a protein x is defined by: where n i is the frequency of amino acid i and N is the number of amino acid residues in the protein x [30]. Oldfield and Dunker [31] have referred that the flexibility and structural instability of disordered proteins are encoded by their amino acid sequences. Xue et al. [32] have concluded that amino acid composition of the disordered proteins has a difference from the other proteins including ordered ones. Amino acid sequence data is a significant key in the prediction of disordered proteins. Some amino acids have high frequencies in disordered protein than the ordered proteins. In addition, the common feature of disordered proteins is compositional bias, and, the bias supports hydrophilic amino acids besides keeps hydrophobic residues out. Hereby, Arg, Gln, Glu, Lys, Pro and Ser are more abundant, on the other hand, Cys, Ile, Leu, Phe, Trp, Tyr, and Val are more inconsiderable in disordered proteins. As a result, there is a strong relationship between amino acid composition and disordered proteins were identified by Hansen et al. [33], Romero et al. [34] and Oldfield and Dunker [31].

Results
Linear and kernel SVM are applied to the dataset. It is clearly seen from Figure 3, linear SVM method has made some misclassifications. On the other hand, kernel SVM      classifier gives better result than linear SVM as shown in the Figure 4. It is clearly seen from Tables 1 and 2, linear SVM gives worse performance than kernel SVM. It can be seen from Tables 3 and 4, overall prediction rate obtained by linear and kernel SVM in identifying the ordered-disordered proteins are 86.54% and 94.23%, respectively. As shown from results, kernel SVM gives the best classifying scheme for ordered-disordered protein sequences because of its appropriate mathematical background for linear and nonlinear data structure. Consequently, kernel SVM method is a very robust method to identify protein structures as ordered-disordered.

Discussion
In bioinformatics, proteins are large molecular structures that are composed of one or more chains of amino acids with a variety of shapes, size and chemical properties. Different types of protein sequences have specific biochemical function [1]. Disordered proteins are one of the functionally important classes of proteins. In biochemistry, "active proteins" e.g. enzymes are known with their unique three dimensional folded structures. However, the disordered proteins are strictly different and they do not have unique three dimensional structure. Moreover, although they have not well defined three dimensional structures, they are considered as active proteins. Up to now, biological activities such as cell cycle control, regulation, sensing, etc. [35][36][37][38][39][40] of many disordered proteins have been reported. Moreover, the scientists consider that disordering provides proteins more flexibilities than those of the ordered proteins have. They can interact more ligands [2]. Disorder structure is mostly observed in proteins implicated in cell signaling, transcription and chromatin remodeling functions. Therefore, classification of ordereddisordered protein sequences is very essential process to express some diseases [41,42] and herewith drug discovery studies.
Experimental classification is costly and timeconsuming because of great amount of raw sequences. Therefore, computational solutions have become a useful tool for analysis [9][10][11][12]. The prediction methods have to give more accurate and rapid solutions. However available methods [43][44][45][46] can allow observing only one protein sequence whether ordered or disordered, and furthermore, they have remarkable slow execution time. So, in this study, more than one sequence of ordered-disordered proteins was classified by using the linear SVM classifier and the kernel classifier with fast execution time. At the same time, kernel SVM gives the high prediction accuracy, therefore, it can be referred that it is a very competitive, robust and fast classification method to identify proteins in terms of their ordered-disordered structures.