Different age distribution patterns of human, nematode, and Arabidopsis duplicate genes
Introduction
Duplicate genes have been considered to be the primary source of genetic novelty since Ohno (1970), but how duplicate genes persist in a genome remains unclear. Lynch and Conery (2000) were one of the first to use genomic data from various species to address this question. Assuming a constant rate of loss of duplicate genes, they used the distribution of Ks between duplicate genes, which is the number of synonymous substitutions per synonymous site, to estimate the half-life of duplicate genes. However, their analysis might have some weaknesses (Zhang et al., 2001). For example, their data might have included alternatively spliced genes, which can result in false duplicate gene pairs, and they counted the number of duplicate gene pairs instead of the number of duplication events. Later, Lynch and Conery (2003a) refined their analysis by using a phylogeny approach to calculate Ka and Ks between duplicate pairs and by incorporating both the birth and loss rates of duplicate genes into their model. The result was consistent with that of their previous study.
In this study, we adopted a new approach and better-annotated data to reanalyze the age distribution of duplicate genes in a genome. First, we cleaned isoforms and repetitive elements before gene family grouping. Isoforms are defined as two different annotated proteins translated from the same gene and are often mistaken as a duplicate pair with a very small Ks. Besides, two proteins may be grouped together simply because they contain the same repetitive element fragment. By cleaning isoforms and repetitive elements, we excluded the major source of false hits in search for duplicate genes. Second, we constructed a phylogeny for each gene family and calculated Ks for every duplication event within the family. Then the Ks value was used as an index for the age of the duplication event. Third, we used only those events with a Ks value between 0.005 and 1 for our study because a gene pair with Ks=0.005 may be just two different alleles (annotation errors), and because accurate estimation of Ks becomes difficult when Ks>1.
Section snippets
Data download and processing
Human genes were downloaded from the Ensembl database release 19 (ftp://ftp.ensembl.org/pub/human-19.34b/data/fasta/). Caenorhabditis elegans genes were downloaded from the wormbase release 123 (ftp://ftp.wormbase.org/pub/wormbase/archive/). Arabidopsis genes were downloaded from the TIGR database release 5.0 (ftp://ftp.tigr.org/pub/data/a_thaliana/ath1/SEQUENCES/). Drosophila genes were downloaded from BDGP Sequence and Annotation Databases Drosophila release 3.0 (//www.fruitfly.org/sequence/release3download.shtml
Age distribution of duplicate genes in eukaryotic genomes
We analyzed four eukaryotic genomes: human, C. elegans, Arabidopsis, and Drosophila. The numbers of duplications with Ks between 0.005 and 1 were listed in Table 1. The fly genome contains only 108 duplications within the Ks range, compared to at least eight times more duplication events in the three other species. Our stringent criteria of selecting duplicates might be one of the reasons for the small number of young duplications in fly. However, the same criteria were also applied to the
Discussion
Recently, Lynch and Conery (2003b) suggested that the age distribution of gene duplication events in a genome is L-shaped, indicating a steady birth and death process. However, our results do not support this hypothesis, showing that the eukaryotic genomes differ greatly in the shape of age distribution of duplications. In our study, only human genome shows such an L-shaped distribution. The distribution in C. elegans shows only a weak peak of very young duplicate genes, while in Arabidopsis,
Acknowledgments
This study was supported by NIH grants. We thank Andre R.O. Cavalcanti for detecting block duplications in human genome and Dr. Michael Lynch for helpful comments.
References (26)
Recent duplication, domain accretion and the dynamic mutation of the human genome
Trends Genet.
(2001)Repeats in genomic DNA: mining and meaning
Curr. Opin. Struct. Biol.
(1998)Repbase Update: a database and an electronic journal of repetitive elements
Trends Genet.
(2000)- et al.
Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes
Plant Cell
(2004) - et al.
Extensive duplication and reshuffling in the Arabidopsis genome
Plant Cell
(2000) - et al.
Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events
Nature
(2003) - et al.
Patterns of gene duplication in Saccharomyces cerevisiae and Caenorhabditis elegans
J. Mol. Evol.
(2003) PHYLIP—Phylogeny Inference Package (Version 3.2)
Cladistics
(1989)- et al.
Genome organization in dicots: genome duplication in Arabidopsis and synteny between soybean and Arabidopsis
Proc. Natl. Acad. Sci. U. S. A.
(2000) - et al.
Age distribution of human gene families shows significant roles of both large- and small-scale duplications in vertebrate evolution
Nat. Genet.
(2002)
Extent of gene duplication in the genomes of Drosophila, nematode, and yeast
Mol. Biol. Evol.
Comparing sequenced segments of the tomato and Arabidopsis genomes: large-scale duplication followed by selective gene loss creates a network of synteny
Proc. Natl. Acad. Sci. U. S. A.
The evolutionary fate and consequences of duplicate genes
Science
Cited by (10)
Role of selection in fixation of gene duplications
2006, Journal of Theoretical BiologyDiagnosing duplications - Can it be done?
2006, Trends in GeneticsMSI1-like proteins: An escort service for chromatin assembly and remodeling complexes
2005, Trends in Cell BiologyDeciphering conserved identical sequences of mature miRNAs among six members of great apes
2018, Zoosystematics and Evolution