Propensity based classification : Dehalogenase and non-dehalogenase enzymes

The present work was designed to classify and differentiate between the dehalogenase enzyme and non– dehalogenases (other hydrolases) by taking the amino acid propensity at the core, surface and both the parts. The data sets were made on an individual basis by selecting the 3D structures of protein available in the PDB (Protein Data Bank). The prediction of the core amino acids were predicted by IPFP tool and their structural propensity calculation was performed by an in-house built software, Propensity Calculator which is available online. All datasets were finally grouped into two categories, namely dehalogenase and non-dehalogenase using Naïve Bayes, J-48, Random forest, K-means clustering, and SMO classification algorithm. By making the comparison of various classification methods, the proposed tree method (Random forest) performs well with a classification accuracy of 98.88 % (maximum) for the core propensity data set. Therefore, we proposed that, the core amino acid propensity could be approved as a novel potential descriptor for the classification of enzymes.


Introduction
Microbial dehalogenases are unique enzymes produced by a microbe that dehalogenates halogenated substances (toxic) by breaking the C-Cl bonds, thus, making a biotechnologically important enzyme group (Arand et al, 1994;Kovalchuk & d'Itri, 2004).In general, these enzymes are classified as hydrolases along with other hydrolytic enzymes that catalyse hydrolytic bond cleavage for C-N, C-P, C-O bonds (Koonin & Tatusov, 1994).The mechanism of bond cleavage is quite similar across all hydrolases regardless of the binding atoms.Apart from the classical enzyme classification techniques, data mining methods are helpful in analyzing large sets of sequences and information retrieval for these enzymes (Borro et al, 2006;Nasibov & Kandemir-Cavas, 2009).Various classification methods are being applied in different areas, as there is no unique classifier that can best classify them due to the presence of various data.For data mining problem especially in case of proteins, classifiers can help achieve full accuracy by considering suitable features such as enzyme physicochemical and structural properties (Banerjee et al, 2010;King et al, 2001).To address this enzyme grouping problem, many data mining approaches were implemented earlier.The most common method followed is clustering enzymes based on their sequence and structural similarity (Fayech et al, 2009).However, these approaches sometimes fail especially incase of proteins (enzymes) where many of them can perform the same function in spite of dissimilarity in their sequence and/or structure.Another significant task for researchers in bioinformatics is to classify these proteins into families based on their structural and functional properties, thereby predicting the functions of these new protein sequences (Krishna et al, 2003).Over few years, new computational methods as well as novel protein features have been developed and implemented to expand the knowledge about protein classification.Some of them are global and local structural alignment algorithms that trace out conformational similarities between proteins indicating functional similarities (Holm & Sander, 1993;May & Johnson, 1994;Taylor, 1999).More practically, computational methods that utilize three-dimensional (3D) structures of proteins are more efficient compared to the sequence-based function prediction.This is due to the fact that protein structures are more conserved than sequences during evolution.Various categories of structural information of proteins such as folding pattern, amino acids forming active sites along with their conformation and interactions pattern with ligands have been used for data mining purposes (Ivanciuc et al, 2002;Oldfield, 2002).Notwithstanding, the availability of high-resolution structural data of target proteins or their homologs, however, remains the major limitation of this methodology.For performing an accurate and efficient classification, a robust strategy in data mining technology is essential and needs a specific dataset that ultimately classifies and improves predictions for unclassified data.Several typical types of classification techniques are available in the literature such as Decision Trees, Naïve Bayesian methods, Sequential Minimal Optimization (SMO), etc. (Delen et al, 2005;Ramesh & Ramar, 2011;Wisaeng, 2013;Wei X et al, 2014).For various data mining purposes, Weka is used as a good simulating software that integrates several data mining features as data pre-processing tools, learning algorithms and performance evaluation methods.Additionally, the graphical user interfaces (GUI) provide an excellent environment for inferring classification details (Amini et al, 2013;Frank et al, 2004).The primary goal of this work is to further classify the dehalogenase class of an enzyme from other hydrolases based on its structural amino acid propensities.For this purpose, a protocol is initially developed to find the amino acids present in the core/surface of a protein and their propensities were calculated.Further, various other methods are employed to separate the two groups of enzymes to examine the different classifiers using Weka tool in order to know which classification algorithms perform better by analyzing different parameters.

Materials and methods
All the works were performed using PC having OS Windows 7, Dual Core, RAM 2 GB, 250 GB HD, and 1.76GZ processing speed.

Data set preparation
Protein Data Bank (PDB) is the primary repository for experimentally determined 3D protein structures.The protein structures available for dehalogenase was retrieved by querying the PDB for structures that are single chained and less than 400 amino acids.The search yielded 90 protein structures, where 45 are dehalogenase and rest 45 are non-dehalogenase (other hydrolases) whose structures are determined using X-ray crystallography (Berman et al, 2000).

Calculation of core and surface residues and propensities
Calculations for core and surface amino acids in a given PDB file were performed using IPFP tool, available on line (Satpathy et al, 2014).This tool first computes the accessible surface area of all the residues by calculating the atomic accessible surface defined by a rolling probe of given size around a van der Waals surface explained by Lee and Richards (Hubbard & Thornton, 1993;Lee & Richards, 1971).Here, a probe size of 1.4 Å was chosen.Further, from the accessible surface area of all amino acids, the core amino acids are predicted; since, those amino acids having non polar accessible surface area is zero and the rest of the amino acids are predicted to be on the surface.The following equations are used to calculate the propensity from individual PDB file.The propensity was calculated automatically from a Matlab script that prepares an input file for the classification.The missing value for the amino acid propensity was assigned zero in the input files.The script in the form of a windows executable is freely available online.Here, the propensity was computed by providing the core amino acids and the total amino acid information as input.The Propensity calculator tool computes the surface exposed propensity (SP) and core propensity (CP) as presented as below (1 and 2) and described by Reddy et In (1), N Soli indicates number of residue i, that are solvent, exposed.T Soli indicates the total number of specific residues that is solvent exposed.T i indicates about the total number of i residues present in the protein.Total is for the total number of residues present in the protein.Similarly, in (2), N Buri indicates, number of residue i , that are buried in the core region.T Buri indicates the total number of specific residues that are buried.T i indicates about the total number of i residues present in the protein.Total is for the total number of residues present in the protein.

Data mining approaches by utilizing propensity feature
The entire propensity computed data were divided into three parts; core alone, surface alone and a combination of two for the complete data set.by assuming that all the attributes are conditionally independent. ( The main advantage of using this classifier is that, they are probabilistic models; hence, it can perform better even there is presence of any noise and missing value in the data also if the sample size is small (De Ferrari & Aitken, 2006).

J48
J48 Decision tree classifier algorithm needs to create a decision tree based on attribute values in the available training data.Basically, a decision tree is a flow chart-like tree structure in which the topmost node in a tree is called a root node, each internal node denotes a test on an attribute, while each branch represents the outcome of the test and every terminal node (leaf node) holds a class label (Kotsiantis 2007).J48 uses divide-and-conquer algorithm to split a root node into a subset of two partitions till leaf node (target node) occur in the tree.Given a set T of total instances (training set), the following steps are used to construct the tree structure.
1. Select a test based on a single attribute with at least two or greater possible outcomes.
2. Then consider this test as a root node of the tree with one branch of each outcome of the test.
3. Partitioning of T into corresponding T1, T2, T3 ... Tn, according to the result for respective cases, and the same may be applied in recursive way to each sub node.

Random forest
Random Forest (RF) is a method of classification, which is based on the gathering of a large number of decision trees.More precisely, it is a combination of decision trees constructed from a training data set, which is validated to generate a prediction of response from the given predictors for future observations.The basic step of the algorithm is as follows: Sample data  training of data  feature selection  splitting of data by best predictor (growing of trees)  estimate error  Random forest (Collection of all trees) Usually this method combines tree predictors such that each tree depends on the values of a random vector with the same distribution for all trees in the forest.The generalization error for forests behaves as limited functions as the number of trees in the forest becomes large.The algorithm works iteratively until the specific numbers of trees are obtained (Yao et al, 2013).

K-means clustering
K-means is one of the simplest and oldest unsupervised learning algorithms that solve the well known clustering problem (Jain 2010).The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) to be fixed first.The main idea is to define k centroids, one class for each cluster.The basic step of k-means clustering is given below: 1. Determine the number of cluster (k) that represents centroid/center of the clusters.
2. Take any random objects as the initial centroids and assign the closest one.
3. After assignment of all objects, re-calculate the position of the centroid.
4. Repeat the step 2 and 3 whenever there is no further movement of the centroid.
5. Separate objects based on number of k.

SMO Sequential Minimal Optimization (SMO) is an algorithm for training of Support Vector Machines (SVMs)
. SVM is a learning machine for two group classification problems that transform the attribute space into multidimensional feature space using a kernel function to separate dataset instances by an optimal hyperplane.As SVM accuracy depends mostly on selection of attributes; hence, a proper attribute selection in a data set increases the performance of the SMO algorithm (KR 2011).SMO algorithm basically works iteratively for solving the optimization problem by breaking a problem into a series of smallest possible sub-problems which are solved analytically.

Performance evaluation for the classifiers
The correct classifications were evaluated comparatively.Basically the performance is based on the True positive (TP) that is correctly identified, False positive (FP) that is incorrectly identified, True negative (TN) that is correctly rejected and False negative (FN) related to incorrectly rejected.The best classification methods obtained were re-evaluated by following performance analysis (Table 4, Table 5, and Table 6).

True positive rate (TPR)
The true positive rate (TPR) is the probability of correctly predicting the positives.

False positive rate (FPR)
The false positive rate (FPR) is the probability of incorrectly predicting the negatives. (5)

Precision
Precision is the ratio of modules correctly classified to the number of entire modules classified fault-prone.It is the proportion of units correctly predicted as faulty.

F-measure (FM)
FM is a combination of recall and precision.It is also defined as harmonic mean of precision and recall and the Recall measures the proportion actual positives which are correctly identified 2.9.6.ROC area ROC (Receiver Operating Characteristics) is a tool for comparing different data models.ROC measures the impact of alterations on the probability threshold and tends to forecast the percent of correct classification.Normally, the value for ROC lies between 0 to 1.The probability threshold is the decision point used by the model for categorization.For the two class classification, the default probability threshold is 0.5.When the probability of a prediction is 50% or more, then the model predicts that class and the result are considered as within true positive regions.

Results and discussions
In the preliminary approach for classification of dehalogenase and other hydrolase enzymes, all 90 proteins structures (45 for each group) were selected from the Protein Data bank (PDB).All of them belong to hydrolase class; however, dehalogenase cleaves C-Cl bond and other hydrolases cleaves C-N, C-P, ester bond etc.The entry details considered from the PDB IDs are given in table 1.For every protein structure in a group, we calculated the core residues and surface residues followed by determination of propensity as explained above in the material and methods section.The residue that is not present in the core or surface is assigned a zero.The core and surface data set contains 20 attributes (for amino acid) and 90 rows (90x20) that correspond to each PDB.Similarly, (90x40) pattern file was generated for total dataset containing both surface and core propensities for a particular PDB.In this way, we used a one-dimensional representation of 3D protein structures based on calculated regional propensity properties, to train with different classification algorithms for automatic enzyme classification purpose.To perform this, all these data were applied to the classifier as training data separately.Among all the classifiers, classification of enzyme propensity data sets was classified into two groups, dehalogenase and non-dehalogenase, respectively.For determination of core propensity in all 3 categories of data set, the Random forest algorithm was found suitable in both classification accuracy and execution time (Table 2, Table 3 and Table 4).

Conclusions
Both core and surface residues are responsible for many features of the protein like substrate binding, thermo-stability, protein folding and several other functions.The core region amino acids in case of a protein are basically conserved during evolution.Also during protein folding, the specific arrangement of these residues forms a 'topological pattern' that provides functional implications to the proteins (enzymes).Hence, the core feature of amino acids is an important feature for protein classification purpose.By analyzing derived results, we conclude that accuracy of Random forest is best in comparison with any other considered algorithms.Here, it is also inferred that the propensity quantitative feature at the core region of the protein can be used as one of the excellent and novel descriptors for the classification of enzymes.In future, it is aimed to implement this novel feature to classify other proteins/enzymes.

Figure 1 .
Figure 1.Schematically representation of steps followed in this work.