Similarity / dissimilarity analysis of protein sequences using the spatial median as a descriptor

A novel 3-D graphical representation of protein sequence has been introduced. A right cone of a unit base and unit height has been selected to represent protein sequences on its surface. The twenty amino acids have been represented by 20 circles and all protein’s residues have been represented by n lines on the cone’s surface. All the spots which represent the protein’s residues have been shown in the cone’s top view. The spatial median of all the spots is used as a new descriptor of any protein sequence. This approach was applied on two short segments of protein of yeast Saccharomyces cerevisiae. The examination of the similarities/dissimilarities for the eight ND5 proteins and the six β-globin proteins illustrate the utility of our approach. A linear correlation and significance analysis have been provided to compare our results and the percentage sequence alignment identity.


INTRODUCTION
There is a huge gap between the growth of protein sequence and the structure databases.Many researchers in different areas have tried to bridge this gap.Protein structure prediction has succeeded in doing this with little charges compared with the experimental methods as NMR and X-ray crystallography.Sequence analysis plays an important role in protein structure prediction; proteins with similar sequences mostly have similar structures [1].Usual representation of protein sequences is alphabetic representation or what is called letter sequence representtation (LSR).LSR represents any protein sequence by letters corresponding to the 20 amino acids.The 20 amino acids' letters are A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, and V.It is difficult to recognize and compare different sequences by using LSR.So, many mathematical approaches were proposed to translate protein sequences from letters to 2D or 3D graphical representations accompanied by mathematical objects such as vectors or matrices to use them as sequence descriptors and compare these mathematical objects.Numerical characterization is very useful whatever it is depending on a graphical representation [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16] or not [17].However, approaches in which the numerical characterization is preceded by a graphical representation are better than those with non-graphical representation because the graphical representation introduces a visual inspection.
Graphical representation of protein sequences may be depending on selecting a geometrical object to represent residues or assigning vectors to residues.An example of assigning vectors to residues in 2D was done by assigning a vector of two components to each amino acid [9].These two components were pKa of NH3 + and COOH as x-and y-coordinates respectively.It is modified by representing the y-coordinate as the difference between each pKa (COOH) of each amino acid and the average of all pKa (COOH) of all twenty amino acids [12].An example of assigning vectors to residues in 3D was introduced by selecting three physicochemical properties of amino acids side chains which are hydropathy index, amino acid side chain charge, and mean accessible surface area (ASA) of side chains [15].Another 3D graphical representation of proteins based on five-letter model of amino acids which converts the twenty letters of amino acids to only five letters [13].Therefore, a vector of three components was assigned to each letter in the reduced sequence.
A square [3,4] and circle [6] are examples of 2D geometrical objects, while a tetrahedron [4] and sphere [16] are examples of 3D geometrical objects.Recently, a unit radius sphere was selected to represent any protein sequence on its surface [16].The sphere's surface was divided into 20 latitude like-circles and n longitude-like semi-circles; n is the protein sequence length.The obtained 3D graphical representation was represented by an assignment of 20 amino acids without depending on a pre-graphical representation of its RNA sequence.
In this paper, a right cone with a unit base radius and unit height has been chosen to represent any protein sequence on its surface.The idea of our approach has been illustrated by applying it on the two short segments of protein of yeast Saccharomyces cerevisiae.This approch is applied on the eight ND5 proteins and on the six β-globin proteins and the similarity/dissimilarity are measured for each type.Our approach is compared with the percentage sequence alignment (PID%) through a linear correlation and significance analysis.

3-D GRAPHICAL REPRESENTATION OF PROTEIN PRIMARY SEQUENCES
A geometric cone has been selected to represent any protein sequence on its surface.A right cone of height h and base radius r is oriented along the z-axis, with vertex pointing up, and with the base located at z = 0 as shown in Figure 1.The right cone can be described by the following parametric Equations: For 0  u  h and 0  θ  2.
We have chosen the cone's height and radius by unity (h = 1, r = 1).Twenty circles and n lines are drawn on the cone's surface to represent any protein sequence of length n.Each circle represents one of the different 20 amino acids, and each line represents a single residue of the protein sequence's residues.The base circle is considered as the first circle.The circumference of the base circle is divided into n equal divisions by using n-points.The coordinates of the first point with θ equal zero are (1, 0, 0).Each line is drawn by using one of these points and the vertex.So, the circumference of all the twenty circles is divided into n equal divisions as shown in Figure 2.
The 20 different amino acids have been ordered alphabetically due to 3-Letter code.Therefore, the 20 circles have been assigned to the 20 amino acids from base to vertex.The base circle is assigned to A-amino acid, and the 2nd circle is assigned to R-amino acid and so on.The distance in z-direction between each circle and the base circle is uj.This distance is the height of each circle.The circle's radius and its height are calculated by the following Equations: The alphabetic order of the 20 amino acids, circles' radii, and the height of each circle are listed in Table 1.By substituting in Eq.1, our proposed approach is expressed as follows: where r = 1 , h = 1, j = 1, 2, 3, •••, 20; depending on the amino acid corresponding circle's number, and i = 1, 2, 3, •••, n; depending on the residue's position in the protein sequence and n is the protein sequence length.The spots of our 3-D graphical representation are calculated by using Eq.3.These spots are placed at the intersections of the circles and lines.Therefore, our 3-D graphical representation is obtained as walking in 3-D space over the intersections of the 20 circles and n lines on the cone's surface.
We have applied our approach on the two short segments of "yeast Sacchromyces cerevisiae".Protein I sequence is "WTFESRNDPAKDPVILWLNGGPGCSSLTGL" and protein II sequence is "WFFESRNDPANDPIILWLN We have proposed cone's top view in order to obtain a good visualization of our 3-D graphical representation.In this one view we can see all the points that represent the protein's residues.

SIMILARITY/DISSIMILARITY ANALYSIS
The spatial median (x, y, z) of all the spots that represent the protein residues has been calculated to characterize each protein numerically.This means that each protein is represented by one point of three coordinates.The Similarity/dissimilarity analysis becomes simpler than before.
The similarity/dissimilarity analysis can be measured by calculating correlation angle or Euclidean distance between the proteins' descriptor.We have calculated the Euclidean distance between the spatial median of each protein.The smaller Euclidean distance is the more similar two protein' sequences.The Euclidean distance between each ND5 proteins are given in Table 3 and between each β-globin proteins are given in Table 4.

RESULTS AND DISCUSSION
Firstly, this approach is applied on the eight ND5 proteins.The 3D graphical representations are shown for three proteins in Figure 3. Figures 3(a  spots corresponding to the 602 residues of opossum protein.The zigzag line is omited from the long protein graphs to avoid overlaping due to the huge number of spots.Our 3D graphical representation has no loss of information.We can reconstruct the underlying protein primary sequence as each node in our 3D graph represents a single residue.The similarity/dissimilarity matrix of the eight ND5 proteins based on the Euclidean distance between the spatial median is illustrated in Table 3. Table 3 displays that the proteins of human, gorilla, common chimpanzee, and pigmy chimpanzee are more similar with each other; fin whale and blue whale are also similar.On the other hand, opossum protein is dissimilar to all the other seven proteins.According to the results listed in Table 3, it is easy to notice that our results agree with the results of sequence alignment and the known fact of evolution [18][19][20].Then, the approach is applied on the six β-globin proteins.The similarity/dissimilarity matrix of the six β-globin proteins based on the Euclidean distance between the spatial median is illustrated in Table 4. Table 4 displays that the proteins of human, gorilla and chimpanzee are more similar with each other.On the other hand, opossum and gallus proteins are dissimilar to them.These results agree with the results in [13].Table 5.The correlation coefficients results for the eight ND5 proteins of our approach and the approaches in Literatures [12] and [15] as compared with percentage sequence identity (PID%) matrix.
Our approach (spatial median) & (PID%) [12] (eigen values) & (PID%) Literature [12]   We have compared our similarity/dissimilarity matrix's results of Table 3 with the percentage sequence identity (PID%).We have used the PID% matrix [15].Our comparison is done through a linear correlation and significance analysis [15].The correlation coefficients results for the eight ND5 proteins of our approach as compared with PID matrix are listed in the 1st column of Table 5.We also have listed the previous comparisons of literatures [12] and [15] with (PID %) matrix in Table 5; they are applied on the same ND5 proteins.Because we have a small set of data (n = 8) which can result high correla-OPEN ACCESS tions, we considered the significance of correlation to check whether the correlation of two sets of data is sufficiently strong or likely occurred by chance.We checked for statistical significance for correlation coefficient values that are greater than 0.7.The correlation coefficients of PID% matrix and our approach and the two approaches in Ref. [12] and [15] are listed in Table 5.Our sample data equals eight so we use (6) degrees of freedom.A t-value of 2.447 or greater indicates a significance of less than 0.05 chance of having occurred by coincidence.By calculating the r-values' corresponding t-values in Table 6, all computed t-values are greater than 2.447.This indicates that r-values in Table 5 are not occurred by chance.

CONCLUSION
A good visualization is obtained by representing the protein residues on the surface of a right cone of a unit radius and unit height.This approach has no loss of information and is independent on a pre-graphical representation of the RNA triplet codons.It is graphed on a limited space and all the spots representing protein's residues can be shown in one view (the cone's top view).This approach is applied on two short and equal segments of protein of yeast Saccharomyces cerevisiae.Our approch is applied also on eight ND5 proteins and six β-globin proteins which are long and non-equal proteins to prove its utility.Our similarity/dissimilarity results are compared with PID% through linear correlation and significance analysis.

Figure 1 .
Figure 1.Display a right circular cone.

Figure 2 .
Figure 2. The 3-D graphical representation of the thirty residues of protein I.
) and (b) indicates the mismatching amino acids for human protein with opossum protein.In the figure, black spots are the similar amino acids in the proteins.Red spots are the amino acids only in human proteins.Blue spots are the amino acids only in opossum protein.

Figure 3 (
a) contains 603 spots corresponding to the 603 residues of human protein; all spots are shown in one view.Figure 3(b) contains 602

Figure 3 .
Figure 3.The cone's top view (a) for human protein compared with opossum; (b) for opossum protein compared with human.Black spots are the similar residues, red spots are different residues in human and blue spots are different spots in the opossum.

Table 1 .
Alphabetic order of 20 amino acids due to 3-letter Each protein I and protein II consists of 30 residues.The thirty residues of protein I are represented on the cone as shown in Figure2.The x, y, and z-coordinates of the two protein segments are listed in

Table 2 ,
which are calculated by using Eq.3.

Table 2 .
According toTable 2, the Euclidean distances are equal zero except for the positions 2, 11, 14, and 27.This means that the two proteins have four mismatching amino acids.

Table 2 .
Two proteins x, y and z-coordinates and the Euclidean distance between the corresponding points.

Table 3 .
The similarity/dissimilarity matrix of the eight ND5 proteins based on the Euclidean distance between the spatial median.

Table 4 .
The similarity/dissimilarity matrix of the six Beta globin proteins based on the Euclidean distance between the spatial median.

Table 6 .
The t-values computed for the correlation coefficients |r| ≥ 0.7, based on them the significance is determined.