Statistical analysis of simple repeats in the human genome

https://doi.org/10.1016/j.physa.2004.08.038Get rights and content

Abstract

The human genome contains repetitive DNA at different level of sequence length, number and dispersion. Highly repetitive DNA is particularly rich in homo- and di-nucleotide repeats, while middle repetitive DNA is rich of families of interspersed, mobile elements hundreds of base pairs (bp) long, among which belong the Alu families. A link between homo- and di-polymeric tracts and mobile elements has been recently highlighted. In particular, the mobility of Alu repeats, which form 10% of the human genome, has been correlated with the length of poly(A) tracts located at one end of the Alu. These tracts have a rigid and non-bendable structure and have an inhibitory effect on nucleosomes, which normally compact the DNA. We performed a statistical analysis of the genome-wide distribution of lengths and inter-tract separations of poly(X) and poly(XY) tracts in the human genome. Our study shows that in humans the length distributions of these sequences reflect the dynamics of their expansion and DNA replication. By means of general tools from linguistics, we show that the latter play the role of highly-significant content-bearing terms in the DNA text. Furthermore, we find that such tracts are positioned in a non-random fashion, with an apparent periodicity of 150 bases. This allows us to extend the link between repetitive, highly mobile elements such as Alus and low-complexity words in human DNA. More precisely, we show that Alus are sources of poly(X) tracts, which in turn affect in a subtle way the combination and diversification of gene expression and the fixation of multigene families.

Introduction

Experiments on kinetics of DNA denaturation and renaturation and the analysis of DNA sequences have revealed that most of our genome is populated by DNA repeats of different length, number and degree of dispersion [1]. Long repeats in few copies are usually orthologous genes, which may contain hidden repeats in the form of runs of amino acids, and retroviruses inserted in the genome. For example, the human genome contains more than 50 chemokine receptor genes which have high sequence similarity [2] and almost one thousands olfactory receptor genes and pseudogenes [3]. Short repetitive DNA sequences may be categorized in high and middle repetitive. The first is formed by tandemly clustered DNA of variable length motifs (5–100 bp) and is present in large islands of up to 100 Mb. The middle-repetitive can be either short islands of tandemly repeated microsatellites/minisatellites (`CA repeats', tri- and tetra-nucleotide repeats) or mobile genetic elements. Mobile elements include DNA transposons, short and long interspersed elements (SINEs and LINEs), and processed pseudogenes [4], [5].

Why should we be interested in repetitive DNA? Tandem repeats with 1–3 base motif can differ in repeat number among individuals; therefore, they are used as genetic markers for assessing genetic differences in plants and animals and in forensic testing. It is known that trinucleotide repeats are involved in several human neurodegenerative diseases (e.g., fragile X and Huntington's disease), and instability of short tandemly repeated DNA has been associated with cancer [6], [7], [8]. DNA repeats increase DNA recombination events and have the potential to destroy (by insertional mutagenesis), to create (by generating functional retropseudogenes), and to empower (by giving old genes new promoters or regulatory signals). Despite their importance in genome dynamics and for medical diagnosis, and despite the advances in the understanding of their role in several prokaryotic and eukaryotic genomes [9], [10], [11], [12], [13], a robust, genome-wide, statistical analysis of interspersed repetitive elements in human genome is still lacking. In particular, the analysis of short repetitive DNA needs to fully exploit the relationship between simple repeats and mobile elements.

Interestingly, all mobile elements such as SINEs, LINEs, and processed pseudogenes, contain A-rich regions of different length [14]. In particular, the Alu elements, present exclusively in the primates, are the most abundant repeat elements in terms of copy number (>106 in the human genome) and account for more than 10% of the human genome [1]. They are typically 300 nucleotides in length, often form clusters, and are mainly present in non-coding regions. Higher Alu densities were observed in chromosomes with a greater number of genes and vice versa. Alus have a dimeric structure and are ancestrally derived from the gene 7SL RNA. They amplify in the genome by using a RNA polymerase III-derived transcript as template, in a process termed retroposition [15], [16]. The mobility is facilitated by a variable-length stretch of an A-rich region located at the 3' end [17], [18].

Although all Alus elements have poly(A) stretches, only a very few are able to retropose [14]. Therefore, the mere presence of a poly(A) stretch is not sufficient to confer on an Alu element the ability to retropose efficiently. However, the length of the A stretch correlates positively with the mobility of the Alu [19].

The Alu repeats are divided into three subfamilies on the basis of their evolutionary age: Alu J (oldest), S (intermediate age) and Y (youngest) [20]. There is an inverse correlation between the age of the Alu subfamily and the proportion of the members with long A-tails in the genome, indicating that loss of A stretches may be a primary, though not the only, inactivating feature in the older subfamilies [19].

In this study, we first investigate exhaustively the distribution and characteristic length size of all homopolymeric repeats (HR) of the kind poly(X) in the complete human genome, where X[A,C,G,T]. By means of simple tools drawn from linguistics, we show that stretches of homopolymeric repeats play a highly specialized role in the DNA text. In addition, we show that the former are more specific words within the human genome with respect to other repeats coded from different alphabets (see Table 1 for a list of alphabets considered here).

We quantify this effect by studying the characteristic positioning patterns of stretches of given composition and length. We then focus on long A stretches in human chromosomes 20, 21 and 22. The latter chromosomes differ substantially in both Alu density and gene density. Chromosomes 21 and 22, for example, are of similar size (together about 1.6% of the human genome), even though chromosome 22 has four times as many genes and twice as many Alu repeats [21]. The comparative analysis of genome-wide distribution of poly(A) and other homo-dinucleotide polymers allows us to examine the mechanisms and constraints of poly(A) elongation or shortening, and how the elongation dynamics is related to the evolutionary instabilities [4], [22], DNA bendability [23], and nucleosome inhibition [24].

Section snippets

Length distributions of poly(X) repeats in human genome

For our study of homopolymeric tracts in the human genome, we used all the finished sequences of the 24 chromosomes, among which were the published sequences of chromosomes 21 and 22, as well as a set of compiled sequences together covering about 3 Giga bases (Gb). DNA sequences were obtained from the Genbank directory of the web site of the National Center for Biotechnology Information (ftp://ncbi.nlm.nih.gov/).

We find that homopolymeric tracts of the type poly(X) are substantially

Discussion

The sequence of the human genome is highly repetitious at different sequence length-scales and the coding sequences comprise less than 5% of it. Many of human genome repeats can be found in mature mRNA and total cellular RNA. RNAs containing repetitive elements include Alu-containing mRNAs which amount to 5% of all known mRNAs [1], [32].

Patterns of homo- and di-nucleotide expansion in human genome suggest an explanation as to why, contrary to vertebrate, low eukaryotes and bacteria avoid the

Conclusions

Despite the availability of several high eukaryote genomes, the evolutionary dynamics of the simplest repeats are not yet fully understood. Since microsatellite slippage mutation rates depend on many factors, among which, repeat motif-length, here, we have studied the genome-wide base composition of the microsatellites and we have particularly focused on the relationships between poly(A) and Alus in the human genome.

We have shown by means of standard linguistic analysis that HRs are highly

Acknowledgements

F.P. acknowledges funding from the Italian Institute for Condensed Matter Physics, under the Forum project STADYBIS.

References (44)

  • P. Lio et al.

    Gene

    (2003)
  • A.M. Weiner

    Curr. Op. Cell Biology

    (2002)
  • H. Herzel et al.

    Physica A

    (1998)
  • P.L. Deininger et al.

    Mol. Genet. Metab.

    (1999)
  • E.W. Englander et al.

    J. Biol. Chem.

    (1996)
  • M.L. Coté et al.

    J. Mol. Biol.

    (2003)
  • A. Wagner

    Trends Genet.

    (2001)
  • E.S. Lander et al., Nature 409 (2001)...
  • G. Glusman et al.

    Genome Res.

    (2001)
  • W.H. Li, Molecular Evolution, Sinauer Associates,...
  • A. Umar et al.

    Nat. Rev. Cancer

    (2004)
  • P. Bois et al.

    Cell. Mol. Life Sci.

    (1999)
  • G. Gifford, R. Brown, Methods Mol. Med. (2004)...
  • C. Acquisti et al., Chaos Sol. Fract. 20 (2004)...
  • D. Holste et al.

    Phys. Rev. E

    (2003)
  • F. Lillo et al.

    Bioinformatics

    (2002)
  • R.N. Mantegna et al.

    Phys. Rev. E

    (1995)
  • D. Dieringer et al.

    Genome Res.

    (2003)
  • H.H. Kazazian

    Science

    (2004)
  • P.L. Deininger et al.

    Genome Res.

    (2002)
  • M.A. Batzer et al.

    Nat. Rev. Genet.

    (2002)
  • A.M. Roy-Engel et al.

    Genome Res.

    (2002)
  • View full text