Integration of Multi-Feature Fusion and PLS-DA in Protein Secondary Structure Prediction

. Protein structure prediction has become one of the central problems in the field of modern computational biology. Protein secondary structure prediction is the basis of the spatial structure prediction of proteins. This paper presents a novel method for protein secondary structure prediction, which integrates multi-feature fusion and partial least square discriminant analysis (PLS-DA). Multi-feature fusion can make full use of the available information of proteins; however, it also leads to high-dimensional and redundant features. Then PLS-DA is utilized to deal with the fused protein data, which can effectively extract features from the protein data and remove the redundant information. Several benchmark datasets are used to verify the performance of the proposed method. The experiment results show that the proposed method gives satisfying prediction results of protein secondary structure compared with existing methods. Therefore the integration of multi-feature fusion and PLS-DA can fully utilize the available protein information, effectively reduce dimension and achieve robust classification in the multi-category analysis of protein secondary structure.


Introduction
Protein is a kind of important biological macromolecules, which plays many vital roles for almost all kinds of biological phenomena.The function of protein is dependent on its spatial structure, which greatly promotes the study of protein structure [1].The methods for protein structure prediction can be divided into two categories.One is based on actual experiment method, including Xray diffraction and Nuclear Magnetic Resonance.The other one is based on computing method, including homology modelling [2], folds recognition [3] and ab initio prediction [4].Though the actual experiment method is more accurate, it takes a long time and is restricted by the technology and equipment.However, the computing method can overcome these limitations and therefore it has great development potential and space.In the field of biological information engineering, protein secondary structure is the basis of the prediction of protein tertiary structure and it is also the first step of many protein researches.Based on the contents of protein secondary structures, the protein structures are classified into four categories, all-α, all-β, α/β and α+β.At present, the application of machine learning in protein secondary structure prediction has become one of the hottest research areas in bioinformatics.
The feature representation of protein has a huge impact on the secondary structure prediction.The usually adopted feature representation of protein includes Amino Acid Composition (AAC) [5], polypeptide composition, functional domain composition [6], physicochemical features, PSI-BLAST profiles [7] and function annotation information [8].Obviously, making full use of available protein feature information can effectively improve the prediction of protein structures.To this end, this paper focuses on how to integrate multiple protein information including the PSI-BLAST profiles, PROFEAT and Gene Ontology as the feature representation of protein.However, it makes the protein data contain much more features than that of samples; therefore it is difficult to be processed by traditional discriminant analysis.The partial least squares discrimination analysis (PLS-DA) [9] proposed by Errikson et al. is a robust discriminant analysis method.It is particularly suitable for the data of multicollinearity among the explanatory variables, high dimension and small sample, which is the case of the protein data after being integrated with multiple features.In this paper, we study the performance of the proposed method of integration of multi-feature fusion and PLS-DA in the prediction of protein secondary structures.

Generating feature vector of protein by multi-feature fusion
In classification, all the information that decides which class a sample falls into is included in the sample feature vector.Therefore, how to create the protein feature vector has a significant influence on the established classification model.In protein secondary structure prediction, in order to make full use of available protein feature information, in this paper the protein feature vector fuses three kinds of commonly used protein information, which are described in the following.

Linear predictive coding of the PSSM
The PSSM (Position Specific Scoring Matrix) [10] of a protein sequence represents homolog information affiliated with its aligned sequences, which is an effective feature for the prediction of protein structure and function.We used the PSI-BLAST program to search the NCBI's non-redundant (NR) database under the parameter setting h=0.001 and j=3 to get the PSSM of each protein sequence.The PSSM matrix elements were standardized between [0, 1] by the following standard sigmoid function Then the linear predictive coding (LPC) scheme was used to turn the PSSM into a fixed-length feature vector.

Gene function annotation features
Gene Ontology (GO) is widely used in the field of bioinformatics, which can describe the function of gene and protein.The GO database used in this paper was downloaded from ftp://ftp.ebi.ac.uk/pub/databases/GO/go a/UNIPROT/gene_association.goa_uniprot.gz(released on Nov 26, 2014).We searched the GO terms of each protein sequence in the GO database and then all the GO terms related to the entire dataset are used to form a feature set: 1, 0, i if its GO terms hit the ith term g if not hit ® ¯ (3) where i g is the i-th GO term and n is the number of GO terms of the dataset.

Extracting structural and physicochemical features from PROFEAT
Sequence structural and physicochemical features have been frequently used in the development of statistical learning models for predicting proteins and peptides of different structural, functional and interaction profiles.PROFEAT is a web server for computing commonly used structural and physicochemical features of proteins and peptides from amino acid sequence [11].In this paper, we acquired 1497-dimension vector of PROFEAT feature for each protein sequence.

Partial least squares discrimination analysis
Partial Least Squares (PLS) algorithm [12] is a widely modeling method in chemometrics and many other fields in recent years [13].The PLS algorithm can establish the relationship between the two data block, eliminate the redundant information to achieve the goal of dimension reduction.The PLS regression model has many advantages such as simple, robust, less calculation, no need to remove any explanatory variables and easy to interpret.PLS-DA is based on PLS regression algorithm [14].When dealing with PLS-DA, the class labels of the samples need to be binary encoded.n m X u is the feature matrix, n g Y u is the class matrix, n is the number of samples, m is the number of features, g is the number of classes.The element ij y in n g Y u represents the relationship between the i-th sample i x in n m X u and the j-th class , which is expressed as Then PLS regression model is modeled between X and Y in the usual way.The PLS algorithm decomposes the X and Y as follows: where T is the score matrix of X , i t is the score vector, P is the load matrix, i p is the load vector, E is the residual matrix and a is the number of latent variables; ' ' where U is the score matrix of Y , i u is the score vector, Q is the load matrix, i q is the load vector and F is the residual matrix.PLS regression algorithm extracts the latent variables from X and Y respectively, which satisfies the following conditions: (i) each group of latent variables extracting maximum variation information from the corresponding matrix; (ii) maximize the covariance between the two groups of latent variables.The general regression equation can be written as where W is the weight matrix of PLS regression algorithm.
For an unknown samples un x , the corresponding predicted value could be calculated by the following formula The predicted value un y is in a real number and it needs to be translated into class attribute.For instance, if the maximum value of un y is in the j-th column, the sample is predicted as the j-th class.Two groups of datasets were adopted to evaluate the proposed method.One group is the high similarity datasets, including Z277 and Z498.The other group is the low similarity datasets, including 1189 and 25PDB.The homology of proteins in low similarity dataset is lower than 40%.The specific information of these datasets is listed in Table 1.

Evaluation method
The k-fold cross validation is used to evaluate the effectiveness of proposed method.The accuracy and overall accuracy are used as the evaluation index, which are formulated as follows: where i TP denotes the true positives of proteins in class i , m i denotes the number of proteins in class i , N denotes the total number of proteins.

Experimental results and discussion
After multi-feature fusion, the protein dataset has some common characteristics such as huge data volume, far less number of samples than the feature dimension and mixed with noise interference.Therefore the discriminant analysis method employed needs to filter out noise, extract the features and create a model for protein secondary structure prediction, which is well suited to PLS-DA.One can see from Table 2 to 5 that the predicted results given by the method of integration of multi-feature fusion and PLS-DA are comparable to the existing methods.Besides, another observation is that protein datasets with low homology are harder to be recognized than those of high homology, especially for the recognition of the protein class of + .This is because PLS-DA is a linear discriminant analysis model; therefore it is not fit for nonlinear classification problem.
PLS-DA can also be used as a dimension reduction method.From Figure 1, it can be seen that the extracted features by PLS-DA can significantly increase the similarity between protein samples within same class and the difference between protein samples from different classes.Therefore, the extracted features by PLS-DA can be combined with other nonlinear classifier, such as SVM to improve the recognition of low homology protein datasets.From Figure 2, it can be seen that PLS-DA is also an effective visualizing technique for protein data.Therefore PLS-DA can amalgamate many actions, such as classification, dimension reduction and data visualization in one for protein secondary structure prediction, and integration of multi-feature fusion further strengthens these advantages.

Conclusion
Multi-feature fusion can achieve the purpose of making full use of available protein information.PLS-DA can effectively extract features and remove redundant information.Integration of multi-feature fusion and PLS-DA can effectively deal with the problem of protein secondary structure prediction.Besides, PLS-DA is a linear discriminant analysis model; therefore it is not fit for nonlinear classification problem, such as recognizing the protein secondary structure of + in low similarity dataset.However, PLS-DA can still be used as a dimension reduction method; when it is combined with a regular nonlinear classifier, a more powerful nonlinear classifier can be constructed.

Table 1 .
Datasets used to verify the effect of proposed method

Table 2 .
Performance comparison of different methods on Z277 dataset.

Table 3 .
Performance comparison of different methods on Z498 dataset.

Table 4 .
Performance comparison of different methods on 25PDB dataset.

Table 5 .
Performance comparison of different methods on 1189 dataset.