Frequency of dipeptides and antidipeptides

Although it is reasonable to expect that the frequency of a generic dipeptide XY in proteins is the same of its counterpart YX, on the basis of an accurate statistical analysis of a large number of protein sequences, it appears that some dipeptides XY are considerably more frequent than their mirror images YX, referred to as antidipeptides. Given that it has been verified that this unexpected anisotropic frequency of occurrence is unbiased by the type of protein sequences that are analyzed, it is possible to conclude that this is a genuine phenomenon. Nevertheless, it was impossible to find the mechanism underlying this unexpected phenomenon, which does not seem to be related to diverse conformational propensities, to the different conformational flexibility of the peptide/antidipeptide pair, to dissimilar accessibility to the solvent or to gene random mutations.


Introduction
Proteins are made by 20 types of a-amino acids, which have different shapes, dimensions, structures, physicochemical properties [1,2] and which are observed with different frequencies [3]. Different amino acid properties have been used to predict a variety of protein features, ranging from subcellular location [4] to protein-protein interfaces [5].
Despite its small dimension, this alphabet of 20 characters allowed Nature to create a large numsber of different proteins, amongst the astronomic number of possible sequences that riches the value of 20 N , where N is the sequence length. Interestingly, protein sequences cannot be back-traced, in the sense that if the sequence ABCDEFG is observed in Nature, the sequence GFEDCBA is not [6]. This asymmetry amongst the possible sequences can be investigated also at the level of short repeats, for example dipeptides.
Nevertheless, here the problem is a bit different, since Nature was able to use all the possible 400 dipeptides that can be written with an alphabet of 20 characters. This means that any of the 400 dipeptides can be frequently found in proteins. In other words, one should not be looking for the existence of the dipepeptide BA, given the existence of the dipeptide AB. As a consequence, the question can be reformulated as follows: is the dipeptide AB equally frequent than the dipeptide BA?
Interestingly, we observed that, in some cases, one of the dipeptides (the AB) is considerably more abundant than its symmetry related antidipeptide BA. The natural abundance of both the amino acids A and B cannot influence the preference of Nature for the dipeptide AB or for the dipeptide BA. We therefore examined a wide series of possible features that might distinguish the dipeptide AB from its counterpart BA. On the one hand, we considered structural features, like secondary structure, accessibility to the solvent or conformational flexibility, and, on the other hand, we examined the possibility that random nucleotide mutations of the genes might cause the prevalence of one of the members of the dipeptide-antidipeptide pair. We did not find any feature that can explain why a certain dipeptide is preferred by Nature over its antidipeptide mirror image. We therefore propose that either this asymmetric frequency is barely casual or that a not yet understood reason determines the occurrence of certain types of dipeptides. where nAB and nBA are the numbers of dipeptides AB and BA. The value of C190 is equal to zero if nAB = nBA or, in other words, when the frequency of observation of the dipeptide AB is equal to the frequency of observation of the dipeptide BA. On the contrary, if one of the two dipeptides, for example AB, is observed more frequently than the other (BA), the value of C190 is larger than 0 and it increases if the difference between nAB and nBA increases. It is possible to compute 190 values of C190 in a set of protein structures, since both A and B indicate only one type of amino acid and since the dipeptides of identical residues (for example AA, CC, DD etc.) are ignored.

Methods
Alternatively, the propensities of a certain type of amino acid to be followed by another type of amino acids were computed. For example, the propensity of alanine to precede glycine is given by CSBJ Abstract: Although it is reasonable to expect that the frequency of a generic dipeptide XY in proteins is the same of its counterpart YX, on the basis of an accurate statistical analysis of a large number of protein sequences, it appears that some dipeptides XY are considerably more frequent than their mirror images YX, referred to as antidipeptides. Given that it has been verified that this unexpected anisotropic frequency of occurrence is unbiased by the type of protein sequences that are analyzed, it is possible to conclude that this is a genuine phenomenon. Nevertheless, it was impossible to find the mechanism underlying this unexpected phenomenon, which does not seem to be related to diverse conformational propensities, to the different conformational flexibility of the peptide/antidipeptide pair, to dissimilar accessibility to the solvent or to gene random mutations. where nAG is the number of times an alanine precedes a glycine, nXG is the number of times a residue (of any type) precedes a glycine, nAX is the number of times an alanine precedes a residue (of any type), and nXX is the number of times a residue (of any type) precedes a residue (of any type). Note that nXX is the number of dipeptides observed in the set of protein sequences, nAX is the number of dipeptides where the first residue is an alanine, nXG is the number of dipeptides where the second residue is a glycine, and nAG is the number of alanineglycine dipeptides. More in general, the propensity of occurrence of a dipeptide BJ is given by where nBJ is the number of dipeptides of type BJ, nXJ is the number of time a residue (of any type) precedes a residue J, nBX is the number of time a residue B precedes a residue (of any type), and nXX is the number of times a residue (of any type) precedes a residue (of any type).
Several sets of protein sequences were considered. In all cases, the data were downloaded from the UniProt database and the redundancy was reduced to 40% of sequence identity with the program cd-hit. In each case, only the sequence of entire proteins were taken into account (protein fragments were ignored) and only proteins, the existence of which was proven experimentally, were considered. The datasets are summarized in Table 1.
Molecular dynamics were performed in vacuo with the program Dynamic of the Tinker software package (10,000 dynamic steps of 1 femtosecond at 298 Kelvin degrees with the amber99 force field and by recording a model every 0.1 Picoseconds) [7]. Five initial conformations were selected for each dipeptide, the termini of which were not capped, and five simulations were performed for each dipeptides. Results were statistically indistinguishable.
Protein threedimensional structures were extracted from the Protein Data Bank [8,9]

Results and Discussion
The C190 values are summarized graphically in Figure 1. Most of them are close to zero, as it must be expected for proteins that contain the same number of dipeptide pairs AB and BA, though some of them are considerably larger than zero. They range from 0.04, for the dipeptides PR/RP, to 33.76, for the dipeptides EP/PE, and their average value is equal to 6.50 (standard error = 0.29).
The twenty average C190 values for the dipeptides that contain one of the twenty types of amino acids are shown in Table 2. It can be seen that if the dipeptides contain proline the C190 values tend to be, on average, higher than the others (average C190 = 11.86). This might be related, to a first approximation, to the conformational rigidity of

Frequency of dipeptides and antidipeptides
this particular amino acid, the side chain of which is conjugated on its main chain nitrogen atom. It is possible, in other words, that the rigidity of proline makes it difficult for some residues to precede or to follow it. However, it must be observed that the lowest C190 value is observed for the dipeptides PR/RP, which contain proline and, therefore, any interpretation uniquely based on the fact that proline is conformationally anomalous is likely to be rather naïve. Moreover, in some cases, it is the dipeptide with proline in the first position (PX) that is observed more frequently than the other dipeptide where proline occupies the second position (XP).
The second highest average C190 value is associated with the dipeptides that contain methionine. In this case, one must observe that the dipeptides MX are considerably more numerous (789,224) than the antidipeptides XM (717,205) and, as a consequence, the C190 value for the MX/XM pair is much larger the zero (10.45). However, this is certainly due to the highly frequent N-terminal methionines, which are sometime (but not always) retained in the sequences deposited in the databases [13].
High average C190 values are also observed for dipeptide/antidipeptide pairs that contain a particular residue like cysteine (average C190 = 9.86), a residue with an anionic side chain like aspartate (average C190 = 9.76), a small apolar residue like alanaine (average C190 = 9.03), or a large aromatic residue like triptophane (average C190 = 8.43). On the contrary, the smallest average C190 values are observed for the peptide/antidipeptide pairs that contain an apolar amino acid like valine (3.47) of an aromatic residue like phenylalanine (4.17).
Some of the C190 values are certainly large (see Table 3). For example, it is much more common to observe the dipeptide PE (7571 observations) than its antidipeptide counterpart (5384 observations) and the C190 of the PE/EP pair is equal to 33.76. Seven pairs of dipeptide/antidipeptide have a C190 larger than 20 (see Table 3). Five of them involve proline and the other two methionine. The other residues can be large (triptophane) or small (glycine and alanine). In only one pair of dipeptide/antidipeptide there is a polar amino acid (glutamic acid). Interestingly, also the pair GP/PG, which contains the two residues (proline and glycine) that are conformationally most different from all the other 18 amino acids, has one of the highest C190 values.
In order to verify if these trends are genuine or are a simple consequence of the insufficient sampling of the protein sequences, I adopted two strategies.
On the one hand, the C190 values were computed on different sets of proteins (see Table 1). I considered proteins expressed in a single organism (Homo sapiens and Escherichia coli), localized in a single sub-cellular location (cytoplasm, membrane, extracellular space), or adopting different types of quaternary structure (monomeric, homooligomeric, and heterooligomeric proteins). The C190 values computed with all these different datasets are shown in Table 4. Several oscillations are observed amongst the different sets. For example, the C190 value for the dipeptides/antidipeptide pair CP/PC ranges from 12 (in the set of membrane proteins) to 30 (in the set of E. coli proteins). However, the C190 values of the dipeptides shown also in Table 3 are always much larger than zero. This supports the hypothesis that the trends previously described are genuine and do not depend on the fact that the amount of information is insufficient. In other words, it is possible to be quite confident that the number of protein sequences used to compute the C190 values is sufficient to delineate a statistically significant tendency.
On the other hand, I used an approach named the Fragmented Prediction Performance Plot [14]. The C190 values were computed by using smaller datasets of increasing size. First, I used 39 nonoverlapping subsets, taken from the Any dataset of Table 1 and each containing 1,000 proteins, and the averages of the C190 values were computed, together with their standard deviations. Then the same procedure was applied a second time to 13 non-overlapping subsets of 3,000 proteins. And then, a third time, with six subsets of 6,000 proteins. And eventually, a fourth time, by using two non-overlapping subsets of 12,000 proteins. Some relevant results are summarized in to make reasonable esstimations of the C190 values. In Table 5 it is possible to see that the values of C190 are rater independent on the number of protein sequences used to compute them. The same is true also for the other C190 values that are not shown in Table 5. As a consequence, it is possible to be quite confident that the number of protein sequences used to compute the C190 values is sufficient to delineate a statistically significant tendency. Table 6 shows the seven pairs of peptide/antipeptide that have the largest difference in propensity. It can be immediately seen that these seven pairs are the same of the seven pairs of Table 3, with the exception of the pair EN/NE which is replaced in Table 3 by the pair AM/MA. The propensity values agree therefore with the C190 values and it can be concluded that (i) there are some dipeptides that are observed much more (or less) frequently than their corresponding antidipeptides; (ii) often proline is part of these dipeptides/antidipeptides; (iii) the GP/PG pair, that contains both the residue with anomalous Ramachandran plots, is amongst the dipeptides that behave differently from their antidipeptide counterparts.

Frequency of dipeptides and antidipeptides
The fact that a dipeptide is more (or less) frequent that its antidipeptide counterpart can depend on numerous factors. The most obvious is that the non-bonding interactions between residue A and residue B in the dipeptide AB are different from those in the dipeptide BA. It is possible that the conformational space accessible to AB is different from that accessible to BA. In other words, two dipeptides of opposite sequence might have an anisotropic conformational energy.
A first way to test this hypothesis is to compute C190 and propensity values for the dipeptides A(X)nB and B(X)nA. In these dipeptides, the residues A and B are separated by n other residues (of any type). To a first approximation, if n is sufficiently large, the residues A and B cannot interact with each other in these dipeptides.
However, it is advisable to avoid large values of n, which would reduce the number of dipeptides that can be analyzed (for example, in a protein containing n+2 amino acids, there is only one A(X)nB dipeptide). For these reasons, the value of n was fixed at 5. This value is sufficiently large to avoid inter-residue contacts (and interactions), even in alpha-helical segments, and small enough to allow the formation of large sets of data. C190 and propensity values for these B(X)nA/A(X)nB pairs are shown in Table 7. It is apparent that the C190 values, even if yet quite different from 0, are much smaller than the values of Table 3. Moreover, the propensity values tend to converge, in the sense that they are nearly identical for the B(X)nA and A(X)nB dipeptides. It can therefore be concluded that if there are five residues between the two amino acids A and B, the reciprocal influence between residue A and residue B is extremely much smaller. The anisotropic frequency of the AB and BA dipeptides seems therefore strictly related to short range and geometrically local inter-residue interactions.
The different occurrence of dipeptides and antidipeptides may result from physicochemical reasons or from genetic evolution.
In the first case, one might verify if the physico-chemical properties of the dipeptide AB are different from those of the dipeptide BA. Moreover, this must be done not only for the pairs AB/BA that show a relevant asymmetry of occurrence but also on the pairs XY/YX that show the same frequency of occurrence. In fact, in this way, it is possible to try to discover if the different occurrence of a dipeptide/antidipeptide pair is due to physico-chemical causes.
For this reason, a series of comparison were made between the dipeptides shown in Tables 3 and 6 (EP/PE their propensity values. These comparison were performed on a non redundant set of 1758 protein crystal structures (maximal pairwise sequence identity = 20%, crystallographic resolution not worse than 1.6 Å and R factor not worse than 0.25) created with the PISCES web server [10] and where structures with missing atoms or residues (a phenomenon much more common that usually thought [15]) were disregarded.
The secondary structures, assigned with the Stride computer program [11], the atomic displacement parameters, normalized to zero mean and unit variance in order to allow one to compare different crystal structures [16,17], and the solvent accessibilities, monitored with the Naccess software [12], were unable to distinguish the two types of peptide/antidipeptide pairs. Similarly, a serried of molecular dynamics simulations did not show a different behavious amongst the two types of dipeptide/antidipeptide pairs. Similarly, it was observed that none of the dipeptide/antidipeptide pairs examined here have a systematic tendency to be located at the borders of any type of secondary structural element.
Another possible reification of the asymmetric frequency of certain peptide/antidipeptide pairs relies on gene sequences. It is possible that certain dipeptides are more frequent than others because of the different probability of their emergence as a consequence of nucleotide deletions/mutations. To test this hypothesis, the sequences of the human genes available at the RefSeq database were considered (ftp://ftp.ncbi.nih.gov/refseq/). For each of them, one hundred mutants were created by randomly deleting one of the bases, one hundred mutants were built by deleting randomly five bases, one Frequency of dipeptides and antidipeptides hundred mutants were generated by randomly mutating ten bases, and one hundred mutants were made by changing randomly fifty bases.
After translation of the sequences, performed with the program Transeq of the EMBOSS software suite [18], the C190 values were computed together with the propensities. These were identical in the wild type sequences and in all the four types of mutants. It seems therefore reasonable to suppose that random modifications at the genic level are not responsible for the fact that some dipeptides are more frequent in proteins than others.

Conclusions
Some dipeptides are considerably more frequent than others in proteins. This was quantified by means of two figures of merit, the C190 and the propensity, which monitors different features. The first (C190) monitors to which extent a dipeptide AB is more common than its antidipeptide counterpart BA. The second (propensity) is on the contrary a measure of probability and it indicates the tendency of B to follow A in the dipeptide AB (or the tendency of A to follow B in the antidipeptide BA). Although they are based on different models, both the values of C190 and of propensity indicate that some dipeptides are much more common than their antidipeptides (see Table 3 and 6).
This does not seem to be caused by insufficient sampling. An FPPP analysis [14] shows that the amount of data is sufficient to delineate reliable trends. Moreover, similar tendencies were observed on smaller and more homogeneous sets of protein sequences (monomeric, homooligomeric, heterooligomeric, human, bacterial, nuclear, cytoplasmic or extracellular).
Despite numerous attempts, it has been impossible to identify the reasons that make some dipeptides much more common than their mirror images. Local conformational flexibility and local structures were found to be unrelated to the dipeptide frequency as well as the degree of solvent exposure. Also genic mutations were found to be independent of the dipeptide rate of occurrence.
Although it is reasonable to suppose that the intrinsic structural and molecular properties of dipeptides are determined by both their intermolecular connectivity and their interactions with the surrounding environment (see for example the thorough studies on the structures of several dipeptides and on the influence of the solvatation [19,20]), the reasons why some dipeptides are considerably more frequent than their antidipeptide counterparts remains, for the moment, elusive and obscure. This phenomenon is however very surprising and would deserve further analyses in the future.
In particular, one can anticipate that analyses on longer protein segments (like for example tripeptides, tetrapeptides or longer peptides) might provide additional and interesting information. Unfortunately, the information presently available in the databases, especially about protein structures, is insufficient to perform reliable statistical surveys of these longer polypeptide fragments. It is also possible that additional and interesting information might be provided by more extensive molecular dynamics simulations of the dipeptide/antidipeptide pairs, both isolated and in the context of protein structures. Eventually, a further open question is the understanding of why some residues prefer to precede of follow other residues, something that can be examined by considering the sign, positive or negative, of the C190 values, in analogy with what is done