uORFlight: a vehicle towards uORF-mediated translational regulation mechanisms in eukaryotes

Upstream open reading frames (uORFs) are prevalent in eukaryotic mRNAs. They act as a translational control element for precisely tuning the expression of the downstream major open reading frame (mORF) with essential cellular functionalities. uORF variation has been clearly associated with several human diseases. In contrast, natural uORF variants in plants have not ever been identified or linked with any phenotypic changes. The paucity of such evidence encouraged us to generate this database-uORFlight (http://uorflight.whu.edu.cn). It facilitates the exploration of uORF variation among different splicing models of Arabidopsis and rice genes. Most importantly, users can evaluate uORF frequency among different accessions at the population scale and find out the causal single nucleotide polymorphism (SNP) or insertion/deletion (INDEL) which can be associated with phenotypic variation through database mining or simple experiments. Such information will help to make hypotheses of uORF function in plant development or adaption to changing environments on the basis of the cognate mORF function. This database also curates plant uORF relevant literature into distinct groups. To be broadly interesting, our database expands uORF annotation into more species of fungi (Botrytis cinerea), plant (Brassica napus, Glycine max, Gossypium raimondii, Medicago truncatula, Solanum lycopersicum, Solanum tuberosum, Triticum aestivum and Zea mays), metazoan (Caenorhabditis elegans and Drosophila melanogaster) and vertebrate (Homo sapiens, Mus musculus and Danio rerio). Therefore, uORFlight will light up the runway toward how uORF genetic variation determines phenotypic diversity and advance our understanding of translational control mechanisms.


25
Gene expression must be tightly regulated at transcription, translation and post-26 translation levels. The imperfect correlation between protein abundance and mRNA 27 levels suggests translational efficiency regulated by translational control as one of the 28 determinants of protein output from variable mRNA input. This layer of regulation is 29 mediated by the cooperative action between different mRNA elements and trans-acting 1 factors (1). Upstream open reading frames (uORFs) are among the mRNA elements that 2 can confer precise control of protein translation. 3 A uORF initiation codon resides upstream of the coherent mORF, and will be first 4 encountered by 43S scanning ribosome (including 40S ribosomal subunit and eIF2 5 ternary complex). Sequentially, 60S subunit joins in and reconstitutes 80S ribosome for 6 uORF translation elongation, after which the 40S and 60S are disjointed and 40S may 7 remain associated with mRNA. Therefore, usually uORF translation is prioritized over  However, once uORF-mediated precision control has been challenged by genetic 25 variation or mis-regulation, it causes human diseases (6,7). By 2009, 509 human genes 26 had been identified with polymorphic uORFs and some of them have been 27 experimentally associated with different human diseases, including malignancies, 28 metabolic or neurologic disorders, and inherited syndromes. This trend became more 29 striking recently with more genomic variation data released and analyzed (3,8). In 1 contrast, natural variation of plant uORFs has not yet been investigated, even though 2 there are abundant publicly accessible genetic and phenotypic variation data, especially 3 for Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa L.). Since the release of 4 the Arabidopsis reference genome of accession Col-0 in 2000, the rapid development 5 of sequencing technology has bolstered genome-wide association studies (GWAS) by 6 linkage disequilibrium of interesting phenotypic traits with the most probable genetic 7 variation, particularly using the genome sequences of 1135 accessions from the 1001 8 Genomes Project (9-15). Genetic variation of rice has long been used for molecular 9 breeding and recent re-sequencing of a large set of rice accessions, especially from the 10 3000 Rice Genomes Project, generated a wealth of genetic variation for the discovery 11 of useful alleles for agronomic trait improvement (16-24).

12
It is noteworthy that previous genotype-phenotype association studies mainly 13 focused on protein coding regions, while it is becoming more evident that the cis-14 element variation, such as in the promoter regions, weighs a lot in determining 15 phenotypic variation (25). However, there is still less attention that has been paid on the 16 variation of mRNA regulatory elements, such as uORFs. In this study, we used public 17 resources to identify uORF variation for further experimental verification of phenotypic 18 diversity mediated by translational control.

21
A uORF is defined as the presence of an initiation codon in an annotated mRNA 5' 22 leader region and can be categorized into 'Types 1-3' based on the positions of uORF 23 stop codon, with Type1 in 5' leader, Type2 in mORF coding region and Type3 shared 24 with mORF (also known as an mORF N-extension). We term those mRNAs without a 25 uORF as 'Free'. It is obvious that reinitiation is impossible for translation of Type2 26 uORF-controlled mORFs (Figure 1b and c). Hereafter, ORF of both uORF and mORF 27 means that AUG is used as the initiation codon, unless specifically stated. We use the 28 term 5' leader sequence instead of 5' untranslated region (5' UTR), considering the 29 peptide-coding potential of uORFs (26,27). The sequence from -3 to +4 relative to AUG 1 initiation codon (A as +1) corresponding to Kozak consensus (A/GCCAUGG) position 2 is termed as initiation codon context (ICC) with ICCu and ICCm for uORF and mORF, 3 respectively ( Figure 1d). 4 We chose the Arabidopsis Col-0 accession (Ensemble V39; TAIR11) and rice 5 Nipponbare cultivar (MSU V7) as reference genomes for dicot and monocot uORF 6 analysis, respectively. Arabidopsis representative gene models (27445 in total; INDELs with low quality indicated in the VCF files and used the alleles with frequency 16 over 90% as suggested (28). We further removed genes with variants affecting the 17 annotated initiation codon of their mORF. 18 We replaced all the splicing models with filtered variants and searched uORFs of  Level2 indicates minor differences that lead to nucleotide and/or amino acid 27 substitution due to SNPs (Figure 1e). We grouped Arabidopsis accessions on the basis 28 of the latitude at which they were collected (15-degree interval; Supplementary Table 1   29 in Download menu). The frequency of an individual uORF in Arabidopsis was 1 calculated based on its occurrence in the total population and in different latitude ranges.

2
The frequency in rice was calculated as its occurrence in total population and nine 3 subgroups (20). The associated SNP or INDEL identifiers are also recorded along with 4 uORF variants.

5
MySQL database schema was used for uORF information storage and a user-6 friendly PHP web interface was designed to query and download. GO analysis was done 7 using Omicshare online tools (http://www.omicshare.com/) with the default setting. The 8 uORF annotation information of the other species can be found in the website.

11
With the recent recognition of the significance of uORFs within distinct 12 physiological contexts, the following functionalities will help the community quickly 13 overview the progress in this area and find out uORF variation to link with phenotypic 14 diversity at the population level. (20.65%) of uORFs in rice transcripts may arise from incomplete 5' leader annotation.

23
Their prevalence is mostly due to overrepresentation of Type1 uORFs which account 24 for 90.79% and 87.63%, in contrast to only 9.16% and 12.17% for Type2 uORFs of 25 Arabidopsis and rice, respectively. The Type3 uORFs are the least common (19 uORFs 26 in 17 Arabidopsis genes, and 74 uORFs in 41 rice genes), perhaps because they may 27 give rise to N-extension and are likely to alter protein activities or molecular 28 localizations as reported (29). Type2-containing genes tend to occur along with Type1 uORFs, and the significance of their co-existence needs further investigation ( Figure   1 2a). GO analysis shows clearly different enriched terms among those uORF-Free, 2 Type1-only and Type2 genes in both Arabidopsis and rice, suggesting that uORF type consisting of just an AUG and a stop codon has been shown to be sufficient for 18 translational inhibition of three boron (B)-related genes and also sufficient for 19 translational responsiveness to low B stimulation (32). Such short uORFs (6.79%) are 20 more commonly found among Arabidopsis Type1 uORFs and must require other 21 synergistic cis-elements to allow translational responsiveness to specific stimuli.

22
Therefore, the attribute of uORF length appears not to be a definitive parameter for 23 prediction of uORF function. We then asked whether uORF surrounding sequences 24 display some informative patterns. We found that only 0.45% ICCu and 2.99% ICCm  In the current database, we only processed uORFs with AUG as the initiation codon. to act in either trans or cis manners, and this information will need to be evaluated by 18 the combination of ribosome footprinting and mass-spectrometry data, which will be 19 integrated as it becomes available. In addition, uORFs are RNA cis-elements which 20 require trans-acting factors to regulate translation (36). Meanwhile, co-regulatory cis-21 elements, such as the R-motif identified in our previous study (36), may account for 22 uORF regulation specificity and diversity. Information about regulatory trans-acting 23 factors and co-acting cis-element variation will be integrated into the database 24 progressively. Furthermore, a uORF calculator will be developed to predict the 25 regulatory power of natural or synthetic uORFs for tailored protein expression after 26 machine learning of large experimental data is achieved.  We thank Sophia Zebell and Paul J. Zwack at Duke University for comments.