Statistical analysis of simple repeats in the human genome
Introduction
Experiments on kinetics of DNA denaturation and renaturation and the analysis of DNA sequences have revealed that most of our genome is populated by DNA repeats of different length, number and degree of dispersion [1]. Long repeats in few copies are usually orthologous genes, which may contain hidden repeats in the form of runs of amino acids, and retroviruses inserted in the genome. For example, the human genome contains more than 50 chemokine receptor genes which have high sequence similarity [2] and almost one thousands olfactory receptor genes and pseudogenes [3]. Short repetitive DNA sequences may be categorized in high and middle repetitive. The first is formed by tandemly clustered DNA of variable length motifs (5–100 bp) and is present in large islands of up to 100 Mb. The middle-repetitive can be either short islands of tandemly repeated microsatellites/minisatellites (`CA repeats', tri- and tetra-nucleotide repeats) or mobile genetic elements. Mobile elements include DNA transposons, short and long interspersed elements (SINEs and LINEs), and processed pseudogenes [4], [5].
Why should we be interested in repetitive DNA? Tandem repeats with 1–3 base motif can differ in repeat number among individuals; therefore, they are used as genetic markers for assessing genetic differences in plants and animals and in forensic testing. It is known that trinucleotide repeats are involved in several human neurodegenerative diseases (e.g., fragile X and Huntington's disease), and instability of short tandemly repeated DNA has been associated with cancer [6], [7], [8]. DNA repeats increase DNA recombination events and have the potential to destroy (by insertional mutagenesis), to create (by generating functional retropseudogenes), and to empower (by giving old genes new promoters or regulatory signals). Despite their importance in genome dynamics and for medical diagnosis, and despite the advances in the understanding of their role in several prokaryotic and eukaryotic genomes [9], [10], [11], [12], [13], a robust, genome-wide, statistical analysis of interspersed repetitive elements in human genome is still lacking. In particular, the analysis of short repetitive DNA needs to fully exploit the relationship between simple repeats and mobile elements.
Interestingly, all mobile elements such as SINEs, LINEs, and processed pseudogenes, contain A-rich regions of different length [14]. In particular, the Alu elements, present exclusively in the primates, are the most abundant repeat elements in terms of copy number ( in the human genome) and account for more than 10% of the human genome [1]. They are typically 300 nucleotides in length, often form clusters, and are mainly present in non-coding regions. Higher Alu densities were observed in chromosomes with a greater number of genes and vice versa. Alus have a dimeric structure and are ancestrally derived from the gene 7SL RNA. They amplify in the genome by using a RNA polymerase III-derived transcript as template, in a process termed retroposition [15], [16]. The mobility is facilitated by a variable-length stretch of an A-rich region located at the 3' end [17], [18].
Although all Alus elements have poly(A) stretches, only a very few are able to retropose [14]. Therefore, the mere presence of a poly(A) stretch is not sufficient to confer on an Alu element the ability to retropose efficiently. However, the length of the A stretch correlates positively with the mobility of the Alu [19].
The Alu repeats are divided into three subfamilies on the basis of their evolutionary age: Alu J (oldest), S (intermediate age) and Y (youngest) [20]. There is an inverse correlation between the age of the Alu subfamily and the proportion of the members with long A-tails in the genome, indicating that loss of A stretches may be a primary, though not the only, inactivating feature in the older subfamilies [19].
In this study, we first investigate exhaustively the distribution and characteristic length size of all homopolymeric repeats (HR) of the kind poly(X) in the complete human genome, where X[A,C,G,T]. By means of simple tools drawn from linguistics, we show that stretches of homopolymeric repeats play a highly specialized role in the DNA text. In addition, we show that the former are more specific words within the human genome with respect to other repeats coded from different alphabets (see Table 1 for a list of alphabets considered here).
We quantify this effect by studying the characteristic positioning patterns of stretches of given composition and length. We then focus on long A stretches in human chromosomes 20, 21 and 22. The latter chromosomes differ substantially in both Alu density and gene density. Chromosomes 21 and 22, for example, are of similar size (together about 1.6% of the human genome), even though chromosome 22 has four times as many genes and twice as many Alu repeats [21]. The comparative analysis of genome-wide distribution of poly(A) and other homo-dinucleotide polymers allows us to examine the mechanisms and constraints of poly(A) elongation or shortening, and how the elongation dynamics is related to the evolutionary instabilities [4], [22], DNA bendability [23], and nucleosome inhibition [24].
Section snippets
Length distributions of poly(X) repeats in human genome
For our study of homopolymeric tracts in the human genome, we used all the finished sequences of the 24 chromosomes, among which were the published sequences of chromosomes 21 and 22, as well as a set of compiled sequences together covering about 3 Giga bases (Gb). DNA sequences were obtained from the Genbank directory of the web site of the National Center for Biotechnology Information (ftp://ncbi.nlm.nih.gov/).
We find that homopolymeric tracts of the type poly(X) are substantially
Discussion
The sequence of the human genome is highly repetitious at different sequence length-scales and the coding sequences comprise less than 5% of it. Many of human genome repeats can be found in mature mRNA and total cellular RNA. RNAs containing repetitive elements include Alu-containing mRNAs which amount to 5% of all known mRNAs [1], [32].
Patterns of homo- and di-nucleotide expansion in human genome suggest an explanation as to why, contrary to vertebrate, low eukaryotes and bacteria avoid the
Conclusions
Despite the availability of several high eukaryote genomes, the evolutionary dynamics of the simplest repeats are not yet fully understood. Since microsatellite slippage mutation rates depend on many factors, among which, repeat motif-length, here, we have studied the genome-wide base composition of the microsatellites and we have particularly focused on the relationships between poly(A) and Alus in the human genome.
We have shown by means of standard linguistic analysis that HRs are highly
Acknowledgements
F.P. acknowledges funding from the Italian Institute for Condensed Matter Physics, under the Forum project STADYBIS.
References (44)
- et al.
Gene
(2003) Curr. Op. Cell Biology
(2002)- et al.
Physica A
(1998) - et al.
Mol. Genet. Metab.
(1999) - et al.
J. Biol. Chem.
(1996) - et al.
J. Mol. Biol.
(2003) Trends Genet.
(2001)- E.S. Lander et al., Nature 409 (2001)...
- et al.
Genome Res.
(2001) - W.H. Li, Molecular Evolution, Sinauer Associates,...
Nat. Rev. Cancer
Cell. Mol. Life Sci.
Phys. Rev. E
Bioinformatics
Phys. Rev. E
Genome Res.
Science
Genome Res.
Nat. Rev. Genet.
Genome Res.
Cited by (11)
Improved high-throughput MHC typing for non-model species using long-read sequencing
2022, Molecular Ecology ResourcesOn the nature of the domination of oligomeric (dA:dT)<inf>n</inf> tracts in the structure of eukaryotic genomes
2016, Biophysics (Russian Federation)In silico analysis of DNA profile used in forensic science
2010, International Journal of Pharma and Bio SciencesCombining replicates and nearby species data: A Bayesian approach
2010, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)