Identification of novel conserved peptide uORF homology groups in Arabidopsis and rice reveals ancient eukaryotic origin of select groups and preferential association with transcription factor-encoding genes

Background Upstream open reading frames (uORFs) can mediate translational control over the largest, or major ORF (mORF) in response to starvation, polyamine concentrations, and sucrose concentrations. One plant uORF with conserved peptide sequences has been shown to exert this control in an amino acid sequence-dependent manner but generally it is not clear what kinds of genes are regulated, or how extensively this mechanism is invoked in a given genome. Results By comparing full-length cDNA sequences from Arabidopsis and rice we identified 26 distinct homology groups of conserved peptide uORFs, only three of which have been reported previously. Pairwise Ka/Ks analysis showed that purifying selection had acted on nearly all conserved peptide uORFs and their associated mORFs. Functions of predicted mORF proteins could be inferred for 16 homology groups and many of these proteins appear to have a regulatory function, including 6 transcription factors, 5 signal transduction factors, 3 developmental signal molecules, a homolog of translation initiation factor eIF5, and a RING finger protein. Transcription factors are clearly overrepresented in this data set when compared to the frequency calculated for the entire genome (p = 1.2 × 10-7). Duplicate gene pairs arising from a whole genome duplication (ohnologs) with a conserved uORF are much more likely to have been retained in Arabidopsis (Arabidopsis thaliana) than are ohnologs of other genes (39% vs 14% of ancestral genes, p = 5 × 10-3). Two uORF groups were found in animals, indicating an ancient origin of these putative regulatory elements. Conclusion Conservation of uORF amino acid sequence, association with homologous mORFs over long evolutionary time periods, preferential retention after whole genome duplications, and preferential association with mORFs coding for transcription factors suggest that the conserved peptide uORFs identified in this study are strong candidates for translational controllers of regulatory genes.


Background
Upstream open reading frames (uORFs) are small open reading frames found in the 5' UTR of a mature mRNA, and can mediate translational regulation of the largest, or major, ORF (mORF). Regulation by uORFs has been studied in several individual transcripts demonstrating the importance of uORFs in such processes as polyamine production [1], amino acid production [2,3], and sucrose response [4], but the biological effect of uORFs in the vaste majority of transcripts of the genome is still unclear. Upstream start codons (uAUGs) occur in 20-30% of yeast, mammalian, and plant transcript 5' UTRs [5][6][7] therefore potentially thousands of genes are regulated in this manner.
The majority of characterized uORFs appear to act in an amino acid sequence-independent manner, regulating mORF translation by the uORF start codon nucleotide context, by the uORF length, or by the distance between the uORF stop codon and the mORF start codon, rather than by uORF-encoded peptides [8][9][10][11]. Some uORFs, however, do rely on peptide sequences to mediate translational regulation of the associated mORF, but few examples have been identified and characterized to date. In fungi and animals, a few genes have been shown to contain uORFs whose amino acid sequences are similar between two or more species [12][13][14][15][16][17], but only two cases, CPA1 [3] and SAMDC1/AdoMetDC1 [18], have demonstrated uORF sequence-dependent regulation. In plants two groups of genes, S-Adenosylmethionine decarboxylases (AdoMetDCs; EC 4.1.1.50) and group S basic region leucine zipper (bZIP) transcription factors, have been shown to contain uORFs with similar amino acids between monocots and dicots [19,20]. In the former group, mORF translational regulation is dependent on the sequence of the uORF peptide [1,4] and overexpression of the mORF in either group results in stunted or lethal phenotypes, suggesting that these genes play a critical role in growth and/or development. Indeed, AdoMetDC is required for polyamine synthesis, molecules that are implicated in essential plant functions such as cell division, embryogenesis, leaf, root, and flower development, and stress responses [21,22].
In general, it has been difficult to carry out genome-wide surveys of conserved peptide uORFs due to poor annotation of 5' UTRs. The availability of expressed sequence tags (ESTs) has improved exon and intron annotation of the genomic sequence, but they are relatively short and often do not predict the entire mRNA molecule, even when several ESTs overlap the same genomic region and can be assembled to predict one transcript. As there are very few introns in yeast transcripts, prediction of uORF conservation has been attempted in S. cerevisiae by analyzing genomic sequence upstream of predicted mORF start sites [23], but it is still not clear whether these uORFs are truly conserved (i.e., are under negative selection pressures), or are simply undergoing evolutionary drift. With the sequencing of the Aspergillus nidulans genome, comparison to A. fumigatus and A. oryzae has identified 38 uORFs with putatively conserved start and stop codon positions relative to the mORF, 14 of which are conserved in one of Neurospora crassa, Fusarium graminaerum, or Magnaporthe grisea [5], but the authors did not comment on whether the uORF amino acid sequences are also conserved.
With the emergence of large plant full-length cDNA sequence collections [24][25][26], it is now possible to adopt a comparative genomics approach to determine the prevalence of conserved amino acid uORFs in the genome and the persistence of these elements throughout eukaryotic evolution. Because rice and Arabidopsis shared a common ancestor 140-200 million years ago (Mya) [27][28][29], sequence similarity retained over this amount of time provides good candidates for truly conserved peptide uORF sequences. In this study we have used Oryza sativa (rice) and Arabidopsis thaliana (Arabidopsis) full-length cDNA sequence collections to estimate the incidence of conserved peptide uORFs in the rice and Arabidopsis genomes, to determine the prevalence of uORFs within regulatory genes, and to compare evolutionary rates for uORFs versus mORFs. By examining more distantly related sequences, we posit an ancient origin for select uORFs and we provide evidence for one mechanism by which uORFs can arise within genes.

Identification of conserved peptide uORFs by comparison of rice and Arabidopsis transcripts
To identify conserved peptide uORFs, we developed "uORF-Finder", a Perl program that first compares the mORF amino acid sequence of each cDNA from one collection with the mORF sequences of another species' collection to identify putative mORF homologs, and then compares the uORFs in the 5' UTRs of the two paired sequences to identify uORFs with conserved amino acid sequences (see Methods). Comparison by uORF-Finder of a corrected set of 34000 full-length cDNA sequences from Arabidopsis with a similar set from rice resulted in the identification of conserved peptide uORFs in 44 Arabidopsis genes and 36 rice genes, which together comprise 19 homology groups based on uORF amino acid similarity (Tables 1, 2,3;Figures 1,2,3,4,5). All three of the homology groups that had been previously reported were identified by uORF-Finder [1,4]. The other 16 conserved uORFs have not been reported previously. Homologs of these 19 conserved uORF groups also exist in other angiosperm species (Figures 1,2,3,4,5).

Comparison of Arabidopsis homologs detects additional conserved uORFs
Conserved uORFs that are not sufficiently well conserved to be detected in a rice-Arabidopsis comparison could conceivably be detected in ohnologs, homologous genes arising by whole-genome duplication (WGD) [30], and in paralogs, homologous genes arising from segmental duplication or tandem duplication. Modification of uORF-Finder allowed comparison of each full-length cDNA to all other cDNAs in the same collection (see Methods), and identified seven additional conserved uORF homology groups (Tables 1 and 4; Figures 6,7,8). Six of these pairs are ohnologs, created by the most recent WGD (24)(25)(26)(27)(28)(29)(30)(31)(32)(33)(34)(35)(36)(37)(38)(39)(40) in an ancestor of Arabidopsis [31][32][33]. The seventh pair is not found in syntenic regions and is most likely a paralogous pair. It appears to have arisen at about the same time as the recent WGD event because its synonymous substitution frequency (K s value) of 0.7 is similar to the median K s of recent duplicate pairs (0.8) and is within their K s range (0. 4-1.6) [32]. The corresponding rice genes in four of the seven homology groups possess uORFs, but lack sufficient uORF sequence similarity to have been detected in the Arabidopsis-rice comparison (Figures 6,7,8).

Purifying selection maintains uORF amino acid sequences
Pairwise K a /K s tests for selection on amino acid sequences were applied to each uORF homology group and their associated mORFs to determine whether uORF amino acid sequences are under selective constraints similar to their associated mORFs. Both an approximate method (Yn00) and a maximum likelihood method (codeml) were used to calculate mean pairwise K a /K s ratios for each group. A K a /K s ratio less than 1 implies that negative, or purifying, selection has acted on the sequence, a ratio equal to 1 suggests drift, and a ratio greater than 1 indicates that positive selection has acted on an amino acid sequence. It is also true that conservation at the nucleotide level, not the amino acid level, can drive the K a /K s ratio to one. Analysis of all 26 homology groups showed that generally both uORFs and mORFs have been under mild to strong purifying selection since the divergence of each gene pair (Table 5) and these low K a /K s ratios suggest that Unknown Systemically primed response to pathogens [91] 13 Phosphoethanolamine N-methyltransferase Phosphocholine biosynthesis [38] 14 HDZip class I transcription factor Transcriptional control; development [92,93] 15 bHLH transcription factor Transcriptional control; responsive to polyamine? [52,68] 16 MAP kinase Signal transduction PlantsP database [99] 17 Unknown Unknown 18 Transcription co-activator/repressor HsfB1 Mediator of heat shock response [94,95] 19 SAUR protein Mediator of auxin response; calmodulin (CaM) binding IPR003676; [96] uORF conserved in Arabidopsis paralogs 20 Unknown Unknown 21 ERF/AP2 transcription factor Putative regulator of pathogen resistance [97,98]  RING finger (C3HC4-type zinc finger) Ubiquitination; mediator of protein degradation Protein domain analysis* bZIP, basic leucine zipper; bHLH, basic helix-loop-helix; AdoMetDC, S-Adenosylmethionine decarboxylase; Mic-1, colon cancer-associated protein macrophage-inhibitory cytokine 1; MAP kinase, mitogen activated protein kinase; HDZip, homeodomain leucine zipper; ERF/AP2, ethylene response factor/apetala2. *As determined by InterProScan and NCBI conserved domains search. AdoMetDC, S-Adenosylmethionine decarboxylase; PPC, PlantsP protein kinase classification; TPPase, Trehalose-6-phosphate phosphatase. a uORF found upstream of annotated mORF-containing locus (within 2 kb). b As designated by Bailey et al [68], nomenclature agreed upon by both Heim et al [69] and Toledo-Ortiz et al [67]. c At4g25670 and At4g25690 (tandem duplicates) have the same recent retained duplicate (not reported by Blanc and Wolfe). d As designated by the PlantsP database [99]. e Not found in Blanc and Wolfe's initial analysis of ohnologs, but synteny and homology suggest they are retained recent duplicates. Alignments of plant uORF homology groups 1-4 Figure 1 Alignments of plant uORF homology groups 1-4. Plant sequences were aligned using ClustalW v. 1.82 and displayed using Jalview. See main text for abbreviated species names and Genbank accession number, cDNA clone number, or genome identifier.
Alignments of plant uORF homology groups 8-11 Figure 3 Alignments of plant uORF homology groups 8-11. Details as in Figure 1. Decimal places in the group number indicate multiple conserved uORFs in a given 5' UTR.
Alignments of plant uORF homology groups 12-15.1 Figure 4 Alignments of plant uORF homology groups 12-15.1. Details as in Figure 1. Decimal places in the group number indicate multiple conserved uORFs in a given 5' UTR. I  I  I  I  I  I  I  I  I  I  I  I  I  I   Alignments of plant uORF homology groups 20 and 21 Figure 6 Alignments of plant uORF homology groups 20 and 21. Details as in Figure 1. Groups with similarity in both the monocot and dicot lineages are shown as separate alignments and as a joint alignment.

Group 12
Group 20 H  H  H  H  H  H  L  H  H  H  H   -I  I  I  I  I  I  I  I  I  I H  H  H  H  H  L  H  H  H  H  R  H  H  H   I  -I  I  I  I  I  I  I  I  I the conservation is at the amino acid level, not simply at the nucleotide level.
One possible explanation for low K a /K s ratios in the putative uORFs invokes an incomplete splicing of the fulllength cDNAs for which the uORF and mORF are normally fused. To address this possibility, all Genbank Arabidopsis ESTs were screened for evidence of uORF-mORF translational fusions. No ORFs were found to run continuously between the uORF and mORF, with one exception.
A fusion product (Genbank accession no. DR353698) was identified between the N-terminal and central region of the uORF and the central and C-terminal region of the mORF found at locus At5g03190 (group 17). Classification of this putative uORF is shown in Table 1 for two reasons. Firstly, the four uORF C-terminal amino acids that are excluded in the fusion EST are perfectly conserved in monocot and dicot members, and the position of their stop codon is perfectly conserved, therefore it is difficult to explain this conservation if the uORF is not translated. Figure 7 Alignments of plant uORF homology groups 22 and 23. Details as in Figure 6.

Group 24-Dicot
Secondly, the N-terminal portion of the mORF that is removed in the fusion EST is similar between three Arabidopsis loci of the same homology group, with the start codon position also being conserved in these three members. It is likely, therefore, that the fusion EST represents an alternatively spliced form of this transcript, but further characterization of this locus will be needed to support this conclusion. Most of the homology groups show uORFs with conserved amino acid residues at the C-terminus and an identical positioning of the uORF stop codon (Figures 1, 2 , 3, 4, 5, 6, 7, 8). This would suggest that the full-length cDNAs are fully spliced and are not erroneously predicting uORF sequences due to incomplete splicing.

Conserved features of uORF sequences
The lengths of uORFs vary to differing degrees within and among homology groups, but in amino acid sequence alignments nearly all groups exhibit considerable conservation of the position of the N-terminus and/or the C-terminus, i.e., length variation is usually due to a variable region in the middle or at one end of the uORF (Table 6; Figures 1, 2 , 3, 4, 5, 6, 7, 8). The amino acid sequences of some uORFs possess potentially interesting features. Notably, some uORF groups possess regions rich in serine, threonine, and/or tyrosine, and others possess regions rich in lysine and/or arginine. Two homology groups are particularly noteworthy: Group 8 uORFs specify peptides with a coiled coil-helix, coiled coil-helix (CHCH) domain (Pfam accession number PF06747; Figure 9), and group 13 uORFs encode peptides that are extremely serine/arginine-rich ( Figure 10). Both of these unusual peptides will be discussed in further detail below.

Most genes with conserved uORFs appear to have regulatory functions
A total of 31% of mORFs encoded by conserved peptide uORF loci in Arabidopsis were predicted to be a transcription factor, as determined by GO molecular function terms (Tables 2 and 4), whereas only 5.9% of all Arabidopsis loci are predicted to encode transcription factors [34]. Thus, genes predicted to encode transcription factors are significantly overrepresented (p = 1.2 × 10 -7 ) among conserved peptide uORF loci. In each case, GO terms were validated by manual annotation of protein functions using domain predictions from NCBI Conserved Domain  Blanc and Wolfe (2004) report that At2g3790 and At3g53670 are retained recent duplicates, but the At2g3790 locus has since been replaced by At2g3780. b As defined by Nakano, et al [97] and previously characterized as part of subfamily B-6 by Sakuma, et al [100]. c uORF found upstream of annotated mORF-containing locus (within 2 kb).
and InterProScan Database searches [35,36]. A variety of different types of transcription factors, including bZIP, Ethylene Response Factor/Apetala 2-like (ERF/AP2-like), basic helix-loop-helix (bHLH), and homeobox proteins, are represented among conserved peptide uORF loci with no demonstrable bias. No other GO terms were found to be significantly over-or under-represented in the uORF data set.
Biological functions could be inferred for 16 of the 26 uORF homology groups (Table 1). Six groups encode transcription factor homologs and so are presumably involved in transcriptional control (1, 2, 14, 15, 18, and 21). Five groups are likely to be involved in signal transduction, including four protein kinases and a putative cal-modulin-binding protein involved in auxin response (groups 10,16,19,23,25). Two groups are involved in the metabolism of small molecules that regulate plant development: polyamines (group 3) [1] and trehalose (group 11) [37]. One group (13) encodes the key enzyme in the biosynthesis of phosphocholine, which is an intermediate in biosynthesis of phosphatidylcholine and phosphatidic acid; phosphocholine levels influence levels of phosphatidic acid, an important physiological and developmental signal molecule [38][39][40]. Group 7 putatively encodes translation initiation factor eIF5, which influences start codon selection, and Group 26 encodes a RING finger protein, suggesting a role in targeted protein turnover by ubiquitination. Of the remaining 10 groups, 8 encode predicted proteins of unknown function, 1 encodes an ankyrin-repeat protein, and 1 encodes an amine oxidase. Thus, all but two families of conserved uORF genes whose functions are known or can be inferred potentially play a regulatory role in the biology of plants.

Genes with conserved uORFs were preferentially retained after whole genome duplication
Since the most recent WGD event in the Arabidopsis lineage, only 14% of the original gene pairs present in the ancestral tetraploid have been retained as a duplicate pair in the extant Arabidopsis genome, i.e., for the remaining 86% of ancestral gene pairs, one member has been lost [32]. Among 31 ancestral gene pairs that possessed conserved uORFs at the time immediately following the genome duplication, 12 (39%) pairs have been retained in the present Arabidopsis genome (Table 2), which is significantly higher than the genome-wide average (p = 0.0005). The conserved uORF was retained in both copies of each of the twelve retained duplicate pairs. Retention of these 12 uORFs in both paralogs suggests that they act in cis, consistent with the expectation that uORFs typically control translation of downstream mORFs on the same RNA molecule [41].
The overrepresentation of transcription factors among conserved uORF loci could be due, in part, to preferential retention of transcription factor recent duplicates (22.7% retention of transcription factor duplicates vs 14.4% retention genome-wide) [32], but this alone does not account for the high frequency of predicted transcription factors among the uORF loci. When duplicate history bias is removed by calculating GO term frequencies of the pregenome-duplication set of loci, transcription factors are still overrepresented (11/31 loci, or 35%).

Conserved angiosperm uORF peptide sequences in primitive plants and other eukaryotes
To determine whether any of the 19 uORF homology groups conserved between rice and Arabidopsis might also be present in other eukaryotes, we searched for uORF sequences in all Genbank eukaryotic ESTs. Amino acid sequences similar to four homology groups (3, 8, 13, and 15) were detected in non-angiosperms. Group 15 was found only as distantly as a fern (Adiantum); group 3 was found as far from angiosperms as the green algae (Ulva); group 13 was found in an animal (Xenopus tropicalis); and group 8 uORF sequence was found in primitive plants, animals, fungi, and a slime mold (Figures 9 and 10). Another algal sequence (Chlamydomonas) from the Genbank non-redundant database was identified belonging to group 3 (Genbank: AJ841703). The group 13 uORF homolog found in a X. tropicalis EST was also found in a genomic contig sequence [42] in which the uORF homolog is flanked by genes that are more similar to animal sequences than to any known plant sequences. Thus, this group 13 uORF homolog most likely exists in the Xenopus genome rather than being an EST library contaminant.
Sequences similar to group 8 Arabidopsis and rice uORFs were found in most eukaryotes, but transcript sequence following the uORF varied among the different lineages. All land plant uORFs were associated with macrophage inhibitory cytokine-1-like (Mic1-like) mORF sequences while the mORFs downstream of the group 8 uORF homologs in nematodes and arthropods code for an unknown protein and a putative mannosyl transferase, respectively ( Figure 11). Available EST sequences for each of the group 8 uORF homologs in mammals, fungi, algae, and slime mold end shortly after the conserved peptide uORF, suggesting that in these eukaryotes the uORF homolog is not associated with a mORF and is simply a short ORF. This is further supported by more than 10 human ESTs that end at the same position and include a polyA sequence. In the sea squirt lineage a putative mORF is present in the EST sequences, but a full-length cDNA sequence will be needed to further investigate this possibility.
Although there is variability in the sequences found downstream of group 8 uORFs, three features of these uORF homologs are relatively well conserved: the length of the predicted uORF, the relative positions of four cysteine codons, and the positions of two introns ( Figure  9). The length of the uORF peptide ranges from 51 amino acids in Haemonchus (nematode), to 74 amino acids in humans, and length is even more highly conserved within each of the land plant, arthropod, nematode, fungal, and vertebrate lineages (59-62, 65-69, 51-68, 54-66, and 69-74 amino acids, respectively). Four cysteine residues consistently align in all eukaryotes, with nine amino acids separating the first and second cysteine residues, as well as the third and fourth cysteine residues, whereas 11-15 residues separate the second and third cysteines. Two intron positions are perfectly conserved among the land plants, vertebrates, and at least one member of the fungal lineage. The first intron lies between the third and fourth amino acids following the first conserved cysteine position, and the second intron lies between the fourth and fifth amino acids following the fourth conserved cysteine position ( Figure 11). The first and/or second intron positions are present in Dictyostelium, algae, and some fungi, but are absent in nematodes, arthropods, and sea squirts.
The four cysteines are part of a putative coiled coil-helix, coiled coil-helix (CHCH) domain (Pfam accession Group 8 small ORF/uORF alignment and percent identity across various eukaryotes Figure 9 Group 8 small ORF/uORF alignment and percent identity across various eukaryotes. Representative eukaryotic species aligned using Muscle and displayed by percent identity using Jalview. Arrowheads represent two conserved intron positions for all but Mesvi (no genomic support), Dicdi (first but not second intron present), Ciosa (no introns), Caeel (no introns), Drome (no introns), and Neucr (first but not second intron present based on predicted mRNA). See main text for abbreviated species names and Genbank accession number, cDNA clone number, or genome identifier.  I  I  I  I  I  I  I  I  I  I  I  I  I T  T  T  T  T  T  T  T  T  T  T  T  T  T number PF06747), also found in three small yeast proteins, Cox17p, Cox19p, and Mrp10p. Cox17p and Cox19p are required for assembly of functional cytochrome oxidase and Mrp10p is homologous to a nuclearencoded mitochondrial ribosomal protein. A hypothetical human gene, CHCH domain 7 (CHCHD7), is also similar to the group 8 uORF, as determined by BLAST similarity searches.

Phylogenetic relationships among group 8-like ORFs
Fungal, animal, and plant representatives of each CHCHcontaining ORF were identified using a BLAST search, and their evolutionary relationships were inferred using a Bayesian phylogenetic analysis ( Figure 12; Additional file 1). Animal Mrp10p-like (Genbank: BC075310, DR155443 and BX935835), Debaryomyces group 8-like (Genbank: NC_006045), and Dictyostelium Cox19p-like (Genbank: XM_631387) sequences were more divergent than other sequences, causing long branch attraction [43]. Thus, these sequences were removed from the analysis to prevent tree topology distortion. Five distinct clades were observed, which we refer to as Cox17p-like, Cox19p-like, Mrp10p-like, CHCHD7-like, and uORF group . All clades but one (Mrp10p-like) contain representatives from fungi, animals, and plants and are strongly supported, showing branch order probabilities greater than 0.8, which suggests that these sequences emerged in a common eukaryotic ancestor and have since diverged in the three lineages. Mrp10p-like sequences do not strongly group independently of other branches (P = 0.57), which could be due to highly divergent amino acid sequence represented by relatively long branches. The tree shows that the group 8-like proteins are a distinct clade from other CHCH domain proteins (P = 1.0), and that CHCHD7-like proteins are more closely related to group 8-like members than to other CHCH-containing proteins (P = 0.94). The tree topology also indicates that Cox17plike and Cox19p-like genes are more closely related to each other than to other CHCH proteins (P = 0.97).
A separate phylogenetic analysis of the 46 group 8-like sequences shows that most cluster into five taxonomic groups (plants and green algae, arthropods, nematodes, vertebrates, and fungi) with strong branch support (0.85-1.00) in all but the fungal lineage (0.58; Figure 13). Sea squirt sequences group with one of two Branchiostoma sequences with weak branch support (0.53). Dictyostelium, sea urchin (Strongylocentrotus), and one further Branchiostoma sequence do not group with any of these with weak support (0.53). Sea squirt, Branchiostoma, and sea urchin sequences should be more similar to other deuterastomes (includes the vertebrate lineage) than other organisms, but the short group 8-like sequence alignment could prevent resolution of correct evolutionary relationships of some groups (Additional file 2). Despite weakly supported branches, there is strong support for independent clustering of the arthropods, nematodes, vertebrates and plants, as expected.
Although two Branchiostoma group 8-like sequences (Brafl1 and 2) suggest that there has been a duplication event within this lineage, there is no evidence for maintenance of ancient group 8-like gene duplications occurring within the plant, vertebrate, nematode, arthropod, or fungal lineages. In Arabidopsis both the recent and ancient duplicates from two WGD events have been lost from the genome. Only the Mesostigma genome contains two group 8-like transcripts. Their short branch lengths indicate that this duplication occurred relatively recently and it is pos-Diagrammatic representation of Group 8 features among eukaryotes Figure 11 Diagrammatic representation of Group 8 features among eukaryotes. Light grey boxes represent small ORFs/uORFs, four perfectly conserved cysteine residues are shown as 'C', and numbers within triangles represent the number of amino acids between the immediately preceding cysteine and an intron. Brackets surrounding fungal introns represent the variable nature of the intron position and/or presence. White boxes show mORFs directly downstream of the uORFs in a given lineage. Presence of a polyA tail is likely to occur in vertebrates (pA; see Results). Question marks indicate mORFs could be present, but insufficient EST sequence is available to infer this feature reliably.

Vertebrates
sible that insufficient time has passed for loss of the second copy.

Discussion and conclusion
Comparative analysis by uORF-Finder of 5' UTRs in fulllength cDNAs from two distantly related plant species, rice and Arabidopsis, identified conserved peptide uORFs in 58 Arabidopsis loci that comprised 26 uORF homology groups and in 36 rice loci that comprised 19 homology groups, increasing the number of known conserved uORF homology groups from two to 26 and providing useful, new information for investigations of regulatory biology. Because full-length cDNAs derived from both Arabidopsis and rice only represent a fraction of all nuclear genes, not all conserved uORFs are expected to be detected by this approach. Extrapolation to the whole Arabidopsis genome suggests that it possesses approximately 61 to 102 genes with conserved peptide uORFs that are also conserved in the rice genome (see Methods for calculation). An additional 24 conserved peptide uORF genes are predicted among Arabidopsis loci with retained duplicates from the most recent WGD event. In all, there are likely to Phylogenetic tree depicting CHCH domain-containing genes and alignment Figure 12 Phylogenetic tree depicting CHCH domain-containing genes and alignment. Unrooted phylogenetic tree generated using MrBayes 3.0. See main text for abbreviated species names and Genbank accession number, cDNA clone number, or genome identifier.
be approximately 99-140 genes, or 0.38-0.53% of all protein-coding genes, with conserved peptide uORFs in the Arabidopsis genome. Because short conserved uORFs (<20 amino acids) would not have been detected by uORF-Finder, this is a conservative estimate.
To find additional conserved uORFs, more extensive collections of full-length cDNA sequences will need to be developed and/or 5' UTRs predicted from genomic sequence will be required. As full-length cDNA sequence resources become available for other plant species, such as maize [44] and poplar [45], it should be possible to identify additional conserved uORFs that might be specific to taxonomic groups, such as monocotyledons or dicotyledons. Similarly, analysis of ancient tetraploidy events in species such as poplar and maize might be able to identify uORFs conserved between retained duplicates.
Phylogenetic tree depicting group 8 small ORFs/uORFs and alignment Figure 13 Phylogenetic tree depicting group 8 small ORFs/uORFs and alignment. Unrooted phylogenetic tree generated using MrBayes 3.0. See main text for abbreviated species names and Genbank accession number, cDNA clone number, or genome identifier.

Conserved uORF genes are regulatory genes
Based on the study of a few hundred genes, it has been suggested that uORFs are usually associated with mORFs that encode proteins that regulate cell growth [41,46], but a genome-wide study of upstream AUGs (uAUGs) found no correlation of uAUG-containing transcripts with any particular gene ontology (GO) molecular function term in mammalian transcripts [6]. These observations did not differentiate between sequence-dependent and sequenceindependent uORFs. Our analysis shows that genes encoding transcription factors are overrepresented among genes predicted to encode conserved peptide uORFs, representing almost one third of the 58 Arabidopsis loci as compared to 6% of all genes. Moreover, nearly all genes whose function can be reasonably inferred appear to play some regulatory role in the biology of plants.

Do conserved peptide uORFs mediate feedback translational regulation by small regulatory molecules?
Certain eukaryotic conserved peptide uORFs are known to control translation of a downstream mORF in response to a metabolic product such as arginine or polyamines [4,14,47]. In the case of the fungal arginine-regulated carbamoyl-phosphate synthase subunit, a uORF codes for the arginine attenuator peptide that responds to increased arginine concentrations by causing ribosomes to stall near the 3' end of the uORF, interfering with ribosome scanning and translation of the downstream mORF [14]. A similar mechanism has been elucidated for the regulation of AdoMetDC in which the uORF peptide interferes with the termination of uORF translation in a polyaminedependent manner [48,49]. In plants, sucrose is a signaling molecule that controls not only the transcription of many genes, but also translation of a class of bZIP transcription factors via their conserved uORF, suggesting the possibility of sucrose interaction with a uORF-encoded peptide to regulate translation downstream [4].
Our analysis identified not only these previously known examples of genes involved in pathways exhibiting small molecule feedback in a uORF sequence-dependent manner, but several additional genes that might also act via this mechanism. One is the conserved group 13 uORFs, which are present in genes that encode phosphoethanolamine N-methyltransferase (PEAMT/NMT), the key enzyme in phosphocholine (PCho) biosynthesis. Recently, NMT1 has been shown to contain a uORF that differentially affects translation of the mORF in response to exogenously added choline [50]. This effect is observed when the uORF start codon is abolished but it remains to be determined whether the response to choline is uORF sequence-specific. Intriguingly, the group 13 uORF peptide is rich in arginine and serine (40-48% in Arabidopsis and rice genes; Table 6). A variety of arginine-rich peptides 15-20 amino acids long with 5 or more arginines bind to specific RNA sequences [51]. The predicted group 13 uORF peptide has 5-7 arginines in a 16-17 amino acid region, well within this range, suggesting the possibility that it might bind to a specific RNA sequence, perhaps in PEAMT/NMT transcripts. The fact that the group 13 uORF peptide was also found in Xenopus suggests that its regulatory role is widespread in eukaryotes.
Another example is homology group 11, whose mORFs are predicted to encode trehalose-6-phosphate phosphatase (TPPase); trehalose-6-phosphate is postulated to regulate sugar metabolism in plants [37]. In summary, sucrose, polyamines, phosphatidic acid, and trehalose-6phosphate are possible regulators of translation of downstream mORFs through interaction with conserved uORFs. Also interesting in this light are group 19, which specifies an auxin-induced calmodulin-binding homolog, and group 15, which encodes a bHLH transcription factor that is believed to be subject to translational control through its conserved uORF by spermine synthase [52]. Spermine is a polyamine signal molecule necessary for normal plant growth and defense responses.
As mentioned, six conserved uORF families specify transcription factors, one of which is regulated by the small signaling molecule sucrose. In plants, transcription factors often act quantitatively to control target gene expression proportionate to transcription factor concentration [53]. Therefore, it is interesting to consider the possibility that translational control of transcription factor protein levels could be mediated by interaction of a conserved uORF peptide with a metabolite. This might be an effective means for quantitatively modulating the levels of expression of a pathway or network of downstream genes, for instance, in response to changing physiological or environmental conditions. This logic can equally be applied to other key control proteins and their uORFs.

How is translational control mediated by conserved peptide uORFs?
If conserved uORF peptides can regulate mORF levels in response to small molecules, they are clearly analogous to RNA sensors and riboswitches that sense small molecules and regulate transcript translation accordingly [28,54]. It is interesting to think of conserved peptide uORFs too as sensors of cellular, physiological, or developmental conditions. Although the role of conserved uORFs as 'sensors' of cellular metabolites has been clearly established in the cases of polyamine, sucrose, and arginine concentration, it is still not clear how uORF peptides gauge cellular conditions. uORF peptides could affect mORF translation by interacting directly with the ribosomal complex, by associating with other proteins that influence the translational machinery, and/or by stabilizing or destabilizing RNA secondary structures in the 5' UTR that impede or promote mORF translation. Given the variety of uORF peptides represented in the 26 homology groups, each of these possibilities could occur one or more times.
It is perhaps interesting to note also that the uORFs of 9 homology groups are rich in serine, threonine, and/or tyrosine. These amino acids are potential targets for phosphorylation that conceivably could promote or inhibit ribosome stalling or initiation at downstream mORFs. As mentioned above, lysine/arginine-rich motifs could function in RNA binding [51].

Effect of nonsense-mediated decay on uORF transcripts
Because uORFs create a premature termination codon (PTC), the nonsense-mediated decay (NMD) system might target uORF transcripts for degradation. Yoine et al [55] carried out a microarray analysis of plants mutant in the UPF1 ortholog, which is required for NMD.
Among 75 genes that Yoine et al identified that accumulate transcripts at more than twice the level in the upf1 mutant as in wild type Arabidopsis, we found representatives of seven uORF homology groups (1, 7, 10, 12, 13, 15, and 17), suggesting that these uORF transcripts are susceptible to nonsense-mediated decay. The uORFs in these groups might work in a manner analogous to the uORF arginine attenuator protein (AAP) in the fungal CPA1 transcript. The CPA1 transcript exclusively exhibits increased levels of degradation via NMD when the AAP inhibits translation termination in response to high levels of arginine, ultimately decreasing translation using a twopronged approach [56]. Similarly, the above-identified plant uORFs could intensify translational inhibition of their associated mORFs by both blocking the ribosome physically and inducing the NMD pathway.

Evolutionary emergence of uORFs and a 'transcriptional fusion' model
Very little is known about how uORFs arise. In the extant rice and Arabidopsis genomes, sequences homologous to uORFs identified by uORF-Finder were observed only in 5' UTRs and never as part of another mORF, within 3' UTRs, within introns, or in non-transcribed regions. Possible origins of 5' UTR ORFs include (a) fragmentation of mORF sequences, (b) creation of an AUG or alternate start codon by random mutation within the 5' UTR and subsequent selection for the peptide sequence, and (c) relocation of other ORF sequences within the genome to the 5' UTR or upstream region of a given gene and subsequent transcriptional fusion of the two ORFs.
Transcriptional fusions occur in an estimated 2% of adjacently transcribed mRNAs in the human genome [57]. The evolutionary history of uORF homology group 8 suggests a stable transcriptional fusion model leading to uORF emergence in plants, arthropods and nematodes. Group 8 uORFs are associated with three independent mORFs in the land plant, arthropod and nematode lineages, while the vertebrate, slime mold, algal, and fungal small ORFs that are orthologous to group 8 uORFs do not seem to be associated with mORFs. Given the phylogenetic relationships among these species [58], the most parsimonious explanation for the evolutionary origin of group 8 uORFs is that they originated as a small ORF transcribed independently of a mORF. Subsequently, this small ORF gene was displaced via genome rearrangements or transposition events to regions upstream of three independent large ORFs resulting in transcriptional fusions of the two previously independent transcripts. The uORFs and mORFs in the plant, nematode, and arthropod lineages have remained associated within the same transcript for 300-500 My, therefore these transcriptional fusion events seem to be stable and perhaps biologically advantageous. Evidence for other uORF emergence models, such as mORF fragmentation or de novo creation, will require further analysis of closely related organisms.
Potential dual role for uORF proteins uORFs can regulate specific mORF protein expression in trans when the cis uORF is intact [59,60] but it is still unclear whether uORF proteins can play additional roles in the cell. Small proteins, similar in length to uORFs, play a role in plant development and could also be involved in plant defense [61,62]. Potentially, uORFs could affect such processes independently of their role as a translational regulator. Homology group 8 uORFs are largely conserved in length, sequence, and intron position across most eukaryotes, but in fungi, algae, slime mold, and vertebrates, the associated mORF seems to be absent. The absence of the mORF and strong conservation of the uORF amino acid sequence over one billion years in these eukaryotes indicates that, in plants, this protein could act as both a regulator of mORF expression and as a trans acting factor in the cell.
Group 13 uORFs contain peptides similar to RS motifs found in SR proteins. SR proteins are a family of proteins required for alternative and constitutive pre-mRNA splicing [63,64]. A subset of these proteins, shuttling SR proteins, have not only been implicated in splicing but have also been shown to stimulate translation of a reporter gene when fused to the same transcript [65], analogous to a uORF-mORF associated pair. It is possible then, that group 13 uORF proteins could also play a dual role, as a translational regulator and trans factor.
Similarly, some uORFs in mammalian genomes might adopt these dual roles and further characterization of conserved mammalian uORFs [66] could resolve a dual role model.

Applications
K a /K s analyses suggest that conserved peptide uORFs are under mild to strong negative selection and might therefore be useful for resolving orthology and paralogy of specific gene pairs. For example, phylogenetic studies have sometimes failed to identify all members within a uORF homology group when only considering the mORF sequence (e.g. homology group 2). Although the bHLH transcription factor domain occurs in the mORF of all three group 2 members, none were identified in the original studies, and only two of the three members have been included in the latest description of Arabidopsis bHLH family members [67][68][69].
Further characterization of conserved peptide uORFs and their functional mechanisms might also provide useful tools for creating inducible or repressible expression vectors in plants. AdoMetDC1, bZIP11, and PEAMT/NMT1 protein levels are regulated by conserved uORFs in a metabolite-dependent manner (polyamine, sucrose, and choline, respectively) and other conserved uORFs might also regulate mORF translation in response to cellular compounds, such as TPPases. If this is the case, further functional characterization of conserved peptide uORFs could provide the tools necessary to build constructs that are quickly inducible or repressible at the translational level under various conditions.

Identifying conserved uORFs in rice and Arabidopsis
Corrected RIKEN and Genoscope Arabidopsis thaliana ecotype Columbia and NIAS, FAIS and RIKEN Oryza sativa spp. japonica cv Nipponbare full-length cDNA collections were used for all analysis [70]. A cDNA's major ORF (mORF) was defined as the longest ORF starting with an AUG, the sequence upstream of this AUG was designated the 5' UTR, and upstream ORFs (uORFs) were any ORFs found in the 5' UTR starting with an AUG. All ORFs were identified using getorf [71]. Arabidopsis mORFs were aligned to rice cDNAs using tBLASTn with an E-value cutoff = 1e-5 [72,73] to find putative homologs. Rice cDNAs with hits below this threshold were paired with their respective Arabidopsis transcript, 5' UTR sequences extracted from both, uORFs determined using getorf, and all combinations of rice and Arabidopsis uORF peptide pairs aligned using needle [71]. The reciprocal analysis was also performed, starting with rice full-length cDNA sequences and comparing them to Arabidopsis transcript sequences. All uORFs greater than 100 amino acids were excluded from this analysis.
All pairs with scores >50 were kept and examined manually against existing Arabidopsis transcript annotations (TAIR and TIGR) and existing ESTs to determine whether aligned peptides fall within a probable 5' UTR. To validate the putative uORFs, the first 100 amino acids of the Arabidopsis mORF were aligned to Genbank plant ESTs using tBLASTn (E-value = 1e-10, limit: Viridiplantae [orgn] NOT Arabidopsis [orgn], complexity filter off), and all retrieved plant uORF sequences were aligned to rice and Arabidopsis uORFs using ClustalW [74], manually adjusted, and visualized using Jalview [75] (Figures 1, 2 , 3, 4, 5, 6, 7, 8). There were two exceptions to this procedure. Because the uORFs in group 10 are 400-600 bp upstream of the mORF AUG, only the first 25 mORF amino acids were used to search Genbank plant ESTs (first 25 amino acids are very highly conserved). Secondly, high identity was limited to the 3' end of mORFs in group 17, therefore the Arabidopsis transcript's terminal 50 amino acids were aligned to Genbank non-EST plant sequences. Support for a conserved uORF was found in the Medicago truncatula and Lotus corniculatus genomic sequences.
To test whether uORFs appear upstream of non-homologous genes, Arabidopsis uORF sequences were aligned to the entire Arabidopsis genome (version 5) [76] using tBLASTn (E-value = 10). Predicted conserved uORFs were found to lie upstream of the annotated gene instead of in the annotated 5' UTR in approximately 10% of Arabidopsis and 25% of rice genes (Tables 2, 3 , 4). The discrepancies with the accepted annotations, found at TAIR [76] and TIGR [77], respectively, demonstrate the benefit of using full-length cDNA sequences for this analysis.
To determine whether sequences similar to these conserved uORFs reside elsewhere in the rice and Arabidopsis genomes, uORF amino acid sequences were aligned with sequences translated from the genome sequence using tBLASTn [73]. Sequences similar to these uORFs were found within 5' UTRs of homologous mORF loci, and were absent from non-homologous transcripts, intronic regions, and intergenic regions with only one exception, Arabidopsis NMT3 (AGI locus identifier At1g73600). The annotated mORF for NMT3 [78] is not covered by any available full-length cDNA and has no EST support at its 5' end. Thus, we annotated NMT3 by comparison with its paralog, NMT1 (At3g18000) [33]. NMT3 possesses sequences similar to the NMT1 uORF, as well as sequences similar to the NMT1 mORF, but the TAIR annotation fuses these into a single ORF. However, NMT3 possesses potential splice sites that would produce transcripts with uORF and mORF sequences similar to those in NMT1. The NMT3 uORF predicted by one alternative splice model is the same length as, and is 72% identical to, the NMT1 uORF amino acid sequence (Group 13 in Figure 4).
The TAIR website was used to assign locus numbers for each Arabidopsis transcript and the TIGR website for rice locus numbers. The Arabidopsis locus numbers were then used to search for retained duplicates from the recent and ancient whole genome duplications as defined on the Arabidopsis Paralogon website [33].
Calculating K a /K s For homology groups 1-19, K a /K s values for homologous rice and Arabidopsis mORFs and uORFs were determined using pairwise_kaks.PLS (version 1.7) [79]. Both the approximate method (option-kaks yn00) and the maximum likelihood method (-kaks codeml) were used. Any K a /K s values resulting from a K a or K s value >10 was excluded from the analysis, as these values result in inaccurate predictions of K a /K s [80,81]. The K a /K s values for homology groups 20-26 were determined with the same approach using Arabidopsis sequences only.

GO molecular function terms
GO molecular function terms [82] were retrieved from TAIR Locus History pages [76]. GO terms for all Arabidopsis loci were downloaded from the TAIR website and used to compare genome-wide GO molecular function term frequencies to those found in the conserved uORF-containing loci. Statistically significant differences were detected using the Exact Binomial test as described in the R program package [83]. This analysis was also carried out by GeneMerge, a program that incorporates a Bonferroni corrected P-value [84].

Identification of Arabidopsis ohnologs and paralogs with conserved uORF
Conserved uORFs were found in Arabidopsis duplicates in much the same way as conserved uORFs were found between rice and Arabidopsis. uORFs and mORFs were defined in the same way, and mORF sequences were aligned to the entire Arabidopsis full-length cDNA collection using BLASTp (E-value cutoff = 1e-5) to detect transcripts deriving from a duplicated locus. mORFs aligning with ≥ 99% identity were discarded, and uORFs of all remaining pairs were aligned using needle and validated as above.
Sequences similar to uORF homology group 8 were aligned, edited, and analyzed in the same manner with one exception, ngen = 3000000.

Estimate of conserved peptide uORF prevalence Number of Arabidopsis-rice loci
There is an average of 2.23 full-length cDNAs per uORF locus identified (excluding loci identified by BLAST alignment), which suggests that 15200 Arabidopsis genes are represented in the cDNA collections (34000 cDNAs/2.23 cDNAs per locus), representing approximately 60% of all Arabidopsis genes (assuming 26000 genes) [88]. In addition, Kikuchi et al [25] report that the 28000 rice fulllength cDNA sequences represent 20000 transcription units (TUs) and that 64% of these (12800) have a homolog in Arabidopsis. Assuming that 60-100% of these homologs are represented in the Arabidopsis cDNA collections, the estimated number of Arabidopsis homologs screened for uORF conservation is 7800-13000. Only 80% of Arabidopsis genes also have a homolog in rice (~21000) [25], therefore the uORF-Finder program has identified 37-62% of all conserved upstream ORFs (7800/21000 to 13000/21000) when comparing rice and Arabidopsis full-length cDNAs. Therefore, there should be 61-102 loci that contain conserved uORFs: 38 loci found by uORF-Finder, 6 additional loci found by aligning known uORF sequences with the Arabidopsis genome using BLAST, and 17-58 presently unidentified loci. Using both uORF-Finder and BLAST algorithms we estimate that between 43% and 72% of conserved peptide uORFs between monocots and dicots have been identified.

Number of Arabidopsis-Arabidopsis loci
A total of 60% of Arabidopsis genes are represented in the full-length cDNA collections used for this study. Therefore, the probability of selecting two loci that have conserved peptide uORFs from the pool of known sequences is 0.6*0.6 = 0.36. This translates to a total of 38 loci that have conserved uORFs using an Arabidopsis-Arabidopsis comparison (14 identified (36%), and 24 unidentified).

Total loci
We therefore predict that there are between 99 and 140 loci in the Arabidopsis genome that contain conserved peptide uORFs, 41-58% of which have been identified.