A Data Mining Method For Improving the Prediction Of Bioinformatics Data

The composition of proteins nearly correlated with its function. Therefore, it is very ungently important to discuss a method that can automatically forecast protein structure. The fusion encoding method of PseAA and DC was adopted to describe the protein features. Using this encoding method to express protein sequences will produce higher dimensional feature vectors. This paper uses the algorithm of predigesting the characteristic dimension of proteins. By extracting significant feature vectors from the primitive feature vectors, eigenvectors with high dimensions are changed to eigenvectors with low dimensions. The experimental method of jackknife test is adopted. The consequences indicate that the arithmetic put forwarded here is appropriate for identifying whether the given protein is a homo-oligomer or a hetero-oligomer.

Array sequential information is often adopted in encoding algorithms. The algorithm is available. The fusion encoding method of PseAA and DC [1,2] was adopted to describe the protein features. Using this encoding method to express protein sequences will produce higher dimensional feature vectors. This paper uses the algorithm of predigesting the characteristic dimension. The reduction algorithm MDM-Isomap [3][4][5] is used. By extracting significant feature vectors from the primitive feature vectors, eigenvectors with high dimensions are changed to eigenvectors with low dimensions. The experimental algorithm of jackknife test and classification algorithm of KNN are adopted. The consequences indicate that the arithmetic put forwarded here is appropriate for identifying whether the given protein is a homooligomer or a hetero-oligomer.

Method
There should be an appropriate way to represent protein P. The fusion encoding algorithm of PseAA and DC [1,2] was adopted to describe the protein features. DC and PseAA is introduced as follows.

Dipeptide Composition
DC [1,2] is an encoding arithmetic. Protein is showed by a 400-D eigenvector. It could be counted by Eq. 1. (1)

PseAA Composition
PseAA is an encoding method. The protein could be counted as (2) where the 20+ are generated by (3) The value is 0.05.
is the th tier relevant coefficient factor. is the likelihood of 20 amino acids in alignment. The minimum size of protein array in the data pool is =39. In Eqs. 2 and 3, the is 24. A 20+24=44-D vectors is produced. The fusion encoding algorithm of PseAA and DC was adopted to describe the protein features. A -D vectors are created. We use the algorithm of predigesting the characteristic dimension. A MDM-Isomap [6-8] is adopted.

MDM-ISOMAP
The calculation process of MDM-Isomap algorithm is divided into three steps [9][10][11]. Firstly, dissimilar neighbors' map is constructed Secondly, the fully connected graph can represent the points. Thirdly, calculate the Euclidean distances , between all points , . The minimax distance metric , is proposed to replace the Euclidean distance metric , . The criteria for judging whether a point is a neighbor are as follows: if is one of the -nn ( -nearest neighbors) of , or if , ( is some certain value), they are neighborhoods. Construction of a diversity neighboring map . The input vertices in the high-dimensional space can be shown as a fully link map. Compute the Euclidean ranges , among all pairs of input vertices , . The , reflecting the original vertices among and . To get the neighboring of the data vertices, the neighboring choice method by the minimax distance metric , is adopted. The minimax distance metric , can be gotten by Eq.11. These new diversity relations are shown as a diversity graph over the data points, with edges of weight , among all input vertices. In this new minimax range metric, if is one of the nearest neighboring of , or if , , they are neighbors. Computing the geodesic distance in the given diversity neighboring map. This step is similar to the second step of Isomap method. One difference is that geodesic distance among any two vertices and is computed on the new dissimilarity neighboring map. Then we get the matrix of shortest pathways distances map , ′ . , Construction of lower dimensional embedding. Use the classical MDS method with the new shortest pathways distances to map the data into a low dimensional space. [12].

Experimental Results
The experimental algorithm of jackknife test and classification algorithm of KNN and SVM are adopted.

MDM-Isomap& SVM 70-D vectors
MDM-Isomap algorithm is used to extract significant 70-D feature vectors from the primitive 444-D feature vectors, eigenvectors with high dimensions are changed to eigenvectors with low dimensions.
In Table 1, the analogy of identification outcomes by MDM-Isomap algorithm is illustrated. The analogy of identification outcomes by without MDM-Isomap arithmetic is illustrated in Table 1. The accuracy of the first arithmetic is about 70%. The accuracy of the second arithmetic is 80%. The outcomes for each type of 2 kinds of proteins prediction are illustrated in Table 2. The analogy of identification outcomes by MDM-Isomap arithmetic is illustrated in Table 2. The analogy of identification outcomes by without MDM-Isomap arithmetic is also illustrated in Table 2. The consequences indicate that the arithmetic put forwarded here is very appropriate for protein structure prediction.

Conclusions
The composition of proteins nearly correlated with its function. Therefore, it is very important to discuss a method that can automatically forecast protein structure. The fusion encoding method of PseAA and DC was adopted to describe the protein features. Using this encoding method to express protein sequences will produce higher dimensional feature vectors. This paper uses the algorithm of predigesting the characteristic dimension. By extracting significant feature vectors from the primitive feature vectors, eigenvectors with high dimensions are changed to eigenvectors with low dimensions. The experimental