Non-random distribution of homo-repeats: links with biological functions and human diseases

The biological function of multiple repetitions of single amino acids, or homo-repeats, is largely unknown, but their occurrence in proteins has been associated with more than 20 hereditary diseases. Analysing 122 bacterial and eukaryotic genomes, we observed that the number of proteins containing homo-repeats is significantly larger than expected from theoretical estimates. Analysis of statistical significance indicates that the minimal size of homo-repeats varies with amino acid type and proteome. In an attempt to characterize proteins harbouring long homo-repeats, we found that those containing polar or small amino acids S, P, H, E, D, K, Q and N are enriched in structural disorder as well as protein- and RNA-interactions. We observed that E, S, Q, G, L, P, D, A and H homo-repeats are strongly linked with occurrence in human diseases. Moreover, S, E, P, A, Q, D and T homo-repeats are significantly enriched in neuronal proteins associated with autism and other disorders. We release a webserver for further exploration of homo-repeats occurrence in human pathology at http://bioinfo.protres.ru/hradis/.

the distribution of homo-repeats in eukaryotic and bacterial proteomes and quantified the difference between expected and real occurrences in 1.5 million sequences. As presence of low complexity regions can cause cellular toxicity by promoting promiscuous interactions 16 , we investigated the relationships between homo-repeat occurrence, number of protein interactions and diseases. We release a dataset at http://bioinfo.protres.ru/hradis/ for further exploration of homo-repeats occurrence in human diseases.

Results and Discussion
In this study, we focused on the occurrence of homo-repeats in eukaryotic and bacterial proteomes. Previous analyses indicated that homo-repeats of 5 amino acids occur non-randomly 14,17,18 . How large is the difference between the expected occurrences of homo-repeats with real occurrences in 122 proteomes? How many proteins are expected to contain a homo-repeat of a certain length? If we compute the expected number of proteins < N(M)> harbouring a homo-repeat of M residues in a database containing 1 million protein sequences with average length of 500 residues and uniform amino acid frequency of 1/20, we have: (1/20) 500 10 156; N(M 6) (1/20)6 500 10 8 5 6 6 In the case of the human proteome our estimates indicate <N(M = 5)> ≈ 7 and <N(M = 6)> ≈ 0.3. Can this example be expanded into a more general model to study the occurrence of homo-repeats? To this aim, we have derived a recursive equation (Materials and Methods) that estimates the probability of homo-repeats to occur in the central or terminal parts of a protein sequence ( Fig. 1A and Materials and Methods). We used the equation to investigate the frequency of the longest homo-repeat M in a protein sequence of length L (Fig. 1B). Using 122 proteomes (Supplementary Table S1), we studied the length distribution of protein sequences (Fig. 2) and their amino acid frequencies ( Supplementary Fig. S1) to measure the expected number of proteins N(M, L) carrying a specific motif [see Materials and Methods, Eq. 1].
The expected frequencies of motif repeats such as poly-Q, poly-L, and poly-C, differ substantially from those observed in real proteomes ( Fig. 3; Supplementary Materials): the length of homo-repeats in natural proteomes is much larger than the estimate based on amino acid frequencies and protein length distribution (Fig. 2 and  Supplementary Table S1). We report in Table 1 the lengths of homo-repeats whose occurrences in real proteomes have a 10-fold difference from theoretical estimates.
Although previous genome analyses indicated that the minimal homo-repeat length is between 5 and 7 residues 14,[17][18][19] , our results indicate the size varies with the amino acid type. For polar and soluble residues 20 such as H, D, N, K and P, the minimal size is 4, while W, M, Y, F, Q and T, which are often found in amyloid regions 21 , show lengths ≥ 5. Residues occurring in loops (E, S and G) have lengths ≥ 5, whereas those containing hydrophobic elements in their side chains (I, R and A) are associated with sizes ≥ 6 with exception of V and L that have Given the length of the sequence (L) and the sizes of the central (M) and C-terminal (K) motifs, it is possible to compute the probability p that a homo-repeat occurs using the recursive formula presented in Eq. 2. (A) The longest homo-repeat is in the central part of the sequence. (B) The longest homo-repeat is at the C-terminal.
Scientific RepoRts | 6:26941 | DOI: 10.1038/srep26941 lengths ≥ 7 and 8. In general, N, D, and K homo-repeats show shorter sizes than for Q, E, and R, although the motif length slightly depends on the kingdom (Table 1). In the case of the human proteome, all the homo-repeats show lengths ≥ 5 (Table 2), with exception of V, S, A, L, I and M (size: 6) and C (size: 4).

How many partners do proteins with long homo-repeats have?
Our results indicate that homorepeats are more frequent than expected from theoretical estimates. To investigate what common characteristics have the genes harbouring homo-repeats, we analysed their protein networks using BIOGRID (version 3.4.134) 22 . Using 3514 human proteins carrying homo-repeats with size more than 10 fold larger than expected (Table 2), we found an increase in the number of physical partners of R, A, T, G, S, P, H, E, D, K, Q and N repeats (Fig. 4). Out of 320000 interactions reported in the human proteome, we found that 94000 physical associations involve homo-repeats. The largest number of binding partners was observed for D, K, Q, and N, while I, W and Y are not associated with any interaction (Fig. 4). Thus homo-repeat lengths can be connected with the number of physical associations. While hydrophobic homo-repeats are depleted in partners, hydrophilic ones have a larger number of interactions, which is in agreement with previous literature reporting enrichment of binding partners in polar regions with high structural disorder content 15,16 .

What physico-chemical features define human proteins with many interactions?
To understand what physico-chemical features contribute to the interaction ability of homo-repeat proteins, we used the multi-cleverMachine approach 23,24 . Based on the consensus of different predictors, multicleverMachine identifies signals in protein groups 14 . By directly comparing proteins that contain hydrophobic (A, G, C, V, I, L, M, F, Y and W; total of 1261 proteins) and hydrophilic (P, S, N, E, K, R, H, Q and T; total of 2672 proteins) homo-repeats, we found that the latter are enriched in RNA-binding ability and structural disorder (the analysis is based on homo-repeat sizes reported in Table 2 and is reported at the webserver link http://www.tartaglialab.com/cs_multi/confirm/1358/5f36e6e108/) 25,26 . As shown in Fig. 5, the enrichments are significant for both RNA-binding ability (p-value = 10 −35 ; Kolmogorov-Smirnov test and area under the ROC curve AUC = 0.68) and structural disorder (p-value = 10 −38 ; Kolmogorov-Smirnov test and AUC = 0.72). In agreement with the analysis reported in Fig. 4, proteins with a lower number of interacting partners (i.e., containing homo-repeats with C, F, I, L, M, V, W and Y amino acids) show a decreased amount of structural disorder, while those with a high number of partners (i.e., containing homo-repeats with E, D, G, S, Q, N, K, and H amino acids as well as the intermediate cases R and T) have increased nucleic-acid binding propensity. Thus, our findings are in agreement with previous evidence showing that structural disorder correlates with presence of small and polar amino acids 27,28 and is associated with RNA-binding ability 26,29 . Moreover, gene ontology analysis performed with the multicleverMachine approach indicates that not only proteins containing poly-R and poly-K (Fig. 6A), but also those with negatively charged homo-repeats are able to bind RNA (Fig. 6A,B), as highlighted by recent studies 30 . Relation of homo-repeats to human diseases. In agreement with previous literature data 31 -35 , we found that Q, G, L, P, T, D, A, H and V homo-repeats have strong propensities to be coupled with pathology ( Fig. 7; Table 3; Material and Methods). Indeed, a number of reports indicate that sequences containing repeats such as, for instance, poly-A are associated with diseases, including synpolydactyly type II (gene HOXD13), blepharophimosis (FOXL2), oculopharyngeal muscular dystrophy (PABPN1), infantile spasm syndrome (ARX), and holoprosencephaly (ZIC2) 11 . Similarly poly-Q expansions have been associated with Huntington's disease, Dentatorubral Pallidolysian Atrophy (DRPLA), and Spinocerebellar Ataxias (SCA) 36 .
Recently, Manuel Irimia and colleagues identified a number of neuron-specific micro-exons (i.e., 27 nt in length) that are switched on during neural differentiation to enhance specific protein-protein interactions. Most of the micro-exon containing proteins are enriched in structurally disordered regions 37 and about 30% of them are misregulated in the brains of individuals with autism spectrum disorder 37 .
We studied the occurrence of homo-repeats in proteins harbouring micro-exons (895 cases) 37 comparing their frequencies with expected values calculated on 20 random extractions of the human proteome (Table 4). Increasing the motif length from 4 to 9 amino acids, we found that the following homo-repeats are significantly enriched: 4 -S, E, P, A, Q and T; 5 -S, E, P, A, Q, D and T; 6 -S, E, P, Q, D and T; 7 -S, E, P, Q, T and H; 8 -S, E, P, A, Q, T and H; 9 -S, E, P, Q and T (Table 4  The HRaDis database. 8145 out of 59053 H. sapiens proteins (reviewed and un-reviewed entries in the Uniprot database) contain homo-repeats longer than 4 amino acids, which represents a non-negligible component of the proteome (14%). By considering all the homo-repeats currently linked to disease (578 out of 2501 entries; Table 3), the fraction raises to 23%, indicating that homo-repeats are tightly linked with pathology. For instance, out of all the proteins related to neurodegenerative diseases (90 entries), 13 harbour homo-repeats: . This list expands publicly available repositories, such as for instance "PolyQ" 38 , in which only four proteins (ATX1, ATX2, ATX7 and IID) were associated with disease.
To better investigate the link between homo-repeat occurrence and disease, we release the HRaDis database (HomoRepeats and human Diseases, available at http://bioinfo.protres.ru/hradis/), in which human sequences are reported along with OMIM classifications and GO annotations.  Table 1. Lengths of homo-repeats whose frequencies in real proteomes have a 10-fold difference from theoretical estimates. N * is the number of proteins (×10 4 ), N ** is the number of proteomes.  Table 2. Lengths of homo-repeats whose occurrence differs at least 10-fold between natural and expected human proteomes.

Figure 4. Homo-repeats and protein interactions.
Using a total of 94000 physical associations available from BioGRID 22 , we found that human proteins containing poly-E, poly-D, poly-K, poly-Q, and poly-N have more interactions than the rest of the proteome (homo-repeat size is chosen according to Table 2; mean and standard error of the mean are shown). The red line indicates the average number of partners (16 interactions) in H. sapiens (total of 320000 interactions).

Conclusions
In this work, we showed that the number of homo-repeats in eukaryotic and bacterial proteomes is significantly larger than expected from theoretical estimates. Our calculations indicate that the minimal length that is statistically significant varies with amino acid type and proteome. In H. sapiens, occurrence of homo-repeats is associated with high content of structurally disordered regions and enhanced RNA-binding potential, which is in line with recent experimental findings 26,29 . We also observed that protein containing homo-repeats have a large number of interactions, which can promote perturbation of protein networks and cause dysfunction 39 .
Although the functional roles of homo-repeats are unknown, we found that their occurrence is associated with pathology. Certain homo-repeats such as for instance the poly-A tract in Homeobox 2B protein (PHOX2B) are highly conserved in vertebrate species and might have biological function. Yet, it has been reported that poly-A is frequently linked with diseases such as synpolydactyly type II (HOXD13), blepharophimosis (FOXL2), oculopharyngeal muscular dystrophy (PABPN1) and infantile spasm syndrome (ARX) 11 . Similarly, poly-Q expansions are associated with neurodegeneration 36 and their length is proportional to disease severity 40 . The link between homo-repeats and disease is particularly relevant if we consider that a recent study report involvement of low complexity regions in proteins involved in autism 37 .
Possible models for the evolution of homo-repeats have been proposed [41][42][43][44] . Yet, they are still debated, and to assess possible functions, further biological information is necessary. One interesting mechanism that links homo-repeats with protein dysfunction, is that amino acid expansions can be caused by slippage errors in DNA replication, recombination and repair [45][46][47][48][49] . We hope that our work will be useful for the characterization of homo-repeats in the human proteome and that starting from direct analysis of sequences available at http://bioinfo.protres.ru/hradis/, it will be possible to build a catalogue to decipher the biological functions as well as the evolutionary patterns of these sequences.

Material and Methods
Probability of occurrence of the longest homo-repeat at different protein lengths. For a polypeptide of length L containing two amino acid types A and X (any amino acid different from A), the probability of finding A in any region of the chain is equal to p (the probability of finding X is equal to 1-p). Assuming that M is the longest homo-repeat of amino acid A (if A is absent, then M = 0) and K is the length of the homo-repeat adjacent to the C-end of the chain (if the chain terminates with X, then K = 0), we can determine the probability of a homo-repeat in an iterative way. Indeed, if A is added at the C-end, K increases by 1 (if K = M, then M is incremented by 1). The probability of adding A is P(p, M, K + 1, L) = P(p, M, K, L − 1)* p or P(p, M + 1, M + 1, L) = P(p, M, M, L − 1)* p (Fig. 1). By contrast, if X is added, then M does not change, and K becomes 0, and the probability event is P(p, M, 0, L) = P(p, M, K, L − 1)* (1 − p) (Fig. 1). Thus, knowing the joint distribution of M  Using the OMIM database available at http://www.omim.org/, we found that poly-G, poly-A, and poly-P are strongly associated with disease (standardized Z-score > 5; Material and Methods), followed by poly-E, poly-S, poly-Q, poly-L, poly-D and poly-H. Green colour corresponds to homo-repeats with Z-score > 5, yellow to 3 < Z-score < 5, and white with Z-score < 3 (homorepeat size is chosen according to Table 2).
Scientific RepoRts | 6:26941 | DOI: 10.1038/srep26941 and K for the chain length L-1, it is possible to calculate the distribution of M and K for the chain length L. For a chain with one residue: P(p, M = 0, K = 0, 1) = p and P(p, M = 1, K = 1, 1) = 1 − p (the probability of other M and K values for a chain with one residue is equal to zero). By adding up the values for K (0 ≤ K ≤ M), we calculated the probability depending on the length of the largest homo-repeat M and the chain length L (see Results section).
If we take the distribution lengths of proteins and frequencies from the set of 122 proteomes (see Supplementary Table 1) we can measure the expected number of proteins carrying a specific motif size M: where N L is the number of proteins with length L in the database.
Calculation of the probability of homo-repeats occurence. If the probability of finding two homo-repeats with length M is small, our Eq. 1 can be approximated (M ≪ L and M ≠ 0). If the homo-repeat lies at the C-term of the protein, there will be M amino acids of type A and another amino acid X with probability of p M (1 − p). If the homo-repeat lies in the middle of the protein, there will be M amino acids of type A and two other amino acids at the edges with probability of p M (1 − p) 2 . Taken into account that the homo-repeat can be placed in two positions at the edges of the protein and (L − M − 1) in the middle position, the overall homo-repeat probability is: As natural proteins are shorter than 1000 residues, the approximation works at p ≤ 0.05 and M ≥ 4 (Lp M < 0.01). We note that some amino acids, such as for instance leucine, occur with frequency p ≈ 0.1. In such cases, the approach works well if M ≥ 5.
Statistical analysis of homo-repeats and link with disease. If homo-repeat and disease frequencies are independent, the distribution has an average number of proteins.   Table 4. Homo-repeat enrichments in neuronal proteins harboring micro-exons. C indicates the number of cases associated with an amino acid motif of length between 4 and 9 (895 cases) and R indicates the average motif counts measured on 20 random extractions from human proteome (each sample contains 895 cases) 37 .
The standard deviation associated with 20 extractions is reported. Homo-repeats with standardized Z-score > 5 are given in bold.
Scientific RepoRts | 6:26941 | DOI: 10.1038/srep26941 ab ab ab In Eq. 3, 4 and 5, N is the number of proteins in the human proteome, 59053. N a is the number of proteins associated with disease (2501, see Table 3), and N b is the number of proteins with homo-repeats with the length larger or equal to 5. N ab is the number of proteins carrying both characters in our database.
cleverMachine. The cleverMachine (CM) algorithm analyzes physico-chemical properties of two protein datasets 50 . The tool creates profiles, or physico-chemical signatures, for each protein, utilizing a large set of features -both experimentally and statistically derived from other tools. In our analysis we used a number of physico-chemical properties (hydrophobicity, alpha-helix, beta-sheet, disorder, burial, aggregation, membrane and nucleic acid-binding propensities) and 10 propensity predictors per feature. Only differentially enriched properties (p-values < 10 −5 using Fisher's exact test) were used in the calculations. Further information can be found at http://s.tartaglialab.com/page/clever_suite. multiCleverMachine. The multicleverMachine extends the concept of binary comparisons (CM) between protein datases by introducing signal and negative sets 23,24 . After submission of one or more sets for signal and one or more sets as a negative group, the multicleverMachine creates a CM run for each possible combination of elements from the signal and negative sets. The result is presented in an easy-to-read format, allowing at-a-glance interpretation of the CM submission. The multicleverMachine provides visualisation of enrichment strengths per group, enabling to see easily for which groups the various properties like disorder, alpha-helical propensity, etc. are enriched. More details about the method are available at http://www.tartaglialab.com/cs_multi/submission. In addition to the visualisation of individual enrichments, multiCM links each of the datasets to gene ontology analysis (http://www.tartaglialab.com/GO_analyser/universal and related documentation). To calculate GO enrichments, multicleverMachine uses built-in datasets containing all entries available for the proteome of interest (reference set) 23,24 .