Introduction

Noncoding RNA (ncRNA) has received increasing attention because it has a diverse range of functions and participates in many biological pathways 1, 2. Recent transcriptome analysis of the human genome showed that the number of expressed transcripts is remarkably higher than expected from protein-coding sequences. A large number of transcripts are outside any known gene regions 3, 4, which implies that ncRNA genes are widely distributed in the genome. However, identification of ncRNA genes is still a difficult task, partially owing to the lack of common features 5. For example, ncRNAs do not have apparent open reading frames (ORFs) and codon information. This makes the existing gene-finding algorithms fail to identify ncRNA genes. In addition, the current biological method for discovering novel ncRNA genes is still inefficient and costly.

There are lots of expressed sequence tags (ESTs) located in un-annotated intergenic regions, and these are mostly expressed at low levels 3, 4. These low abundance ESTs have traditionally been considered unimportant 'noise' transcripts 6. Most gene-finding algorithms did not analyze these ESTs because of their unreliability and their lack of ORFs 7, 8. Recently, the analysis on a Drosophila cDNA collection revealed that some noncoding transcripts are polyadenylated and appear in the cDNA databases 9. In Arabidopsis, ESTs have also been used in retrieving ncRNAs from the genome 10. These pieces of evidence indicate that ncRNA transcripts might exist in the ESTs.

Here, we have developed a computational strategy for identifying novel ncRNA transcripts from intergenic ESTs in human. The reliability of the predicted ncRNA transcripts has been improved through comparative genomic analysis and promoter prediction. One hundred and eighteen contigs were predicted to be putative ncRNA transcripts. In addition, six of seven randomly selected candidates were verified using RT-PCR experiments in human 2BS cells.

Results

Intergenic regions matched with low abundance ESTs

Although most ESTs were aligned to protein-coding genes, some low abundance ESTs could be perfectly matched to intergenic regions in the human genome, which we named as intergenic ESTs. To screen ncRNA transcripts in the intergenic ESTs, a sliding window of 50-nuclotides (nt) was used to scan the human genome without any overlap. The number of ESTs matched with each 50-nt window was counted. As expected, few ESTs match with intergenic windows, whereas much more are aligned to the windows in exon regions (Figure 1). In total, 1 101 439 of the windows perfectly matched at least one EST in the intergenic regions. To improve reliability, only those intergenic windows that are supported by at least three ESTs were kept, which gave us 288 277 candidate windows.

Figure 1
figure 1

A sliding window of 50-nuclotides (nt) was used to scan the human genome without allowing any overlap between windows. The number of ESTs matching each given 50-nt window was counted separately. The X-axis represents the number of ESTs matching a 50 nt window and the Y-axis is the corresponding count to each X-axis value. The windows matching more than 100 ESTs are counted together. The results indicate that there are many ESTs that can be matched perfectly to intergenic regions.

Conservation of intergenic ESTs across different species

The PhastCons score from the UCSC annotation database was used to estimate the sequence conservation of all 288 277 intergenic windows. We defined the PhastCons score as the average score of 50 positions in a given window 11. If the PhastCons score was greater than 0.8, the corresponding intergenic window was used as a 'seed window' for further analysis. Figure 2 shows data from human chromosome 21 as an example. In total, there were 18 132 'seed windows' in the human genome that satisfied the above criteria (Figure 2D).

Figure 2
figure 2

Supporting ESTs and PhastCons scores for the 50-nt windows. The data from chromosome 21 are presented as an example in (A-C). The PhastCons score is the average score of 50 positions in the window. The conserved windows (PhastCons score ≥ 0.8, indicated by a black horizontal line) are much more abundant in exons (A) than in introns (B) and intergenic regions (C). Intergenic conserved windows with ≥ 3 supporting ESTs were kept as candidates. The numbers of 50-nt windows meeting the above criteria from the whole genome are summarized in (D).

Electronic elongation of putative ncRNA transcripts

To get putative ncRNA transcripts that were as long as possible, all 18 132 seed windows were used for electronic elongation. Each cluster of ESTs corresponding to a given seed window was assembled as a contig, which was the longest putative ncRNA transcript. This produced 3 457 potential ncRNA transcripts, which did not overlap with the RefGene and the KnownGene annotations.

Screening novel ncRNA transcripts using ECgene annotation

Although we did not use intronic ESTs in this work, it was possible that these predicted intergenic transcripts were from novel 5′-end or 3′-end exons of protein-coding genes. In order to exclude this possibility, ECgene annotations at a medium confidence level were used to filter any transcripts that overlapped with alternative spliced regions or alternative transcription start/alternative poly A sites. This gave us 318 potential ncRNA transcripts.

Putative ncRNA transcripts with probable promoters

We reasoned that most of the ncRNA transcripts that were predicted from ESTs were transcribed from Pol-II promoters. Therefore, we predicted the presence of transcription starting sites and core promoter regions within 2 000 nt upstream or downstream of potential ncRNA transcripts using Promoter 2.0. Only those with probable promoters remained.

Two additional criteria were also used: the first was that the length of the predicted ncRNA transcripts was less than 1 500 nt, and the second was that the length of any predicted ORF was not more than 300 nt. As a result, 118 novel ncRNA transcripts satisfying these stringent criteria were obtained (Supplementary information, Table S1). The detailed computational procedure is shown in Figure 3. We also calculated the distance between predicted ncRNAs and their neighboring annotated exons. The average distances were 66 109 bp according to the RefGene annotation and 49 585 bp according to the KnownGene annotation.

Figure 3
figure 3

Computational pipeline for identifying ncRNA genes from low abundance ESTs.

Validation of putative noncoding transcripts

Two strategies were used to validate the predicted ncRNA transcripts. First, we compared our results with recent transcriptome data from 10 human chromosomes. All available information on transcribed fragments (transfrags) obtained from tiling array experiments was downloaded from the UCSC database. Of the 118 putative ncRNAs, 36 transcripts were located in the 10 chromosomes selected for tiling array analysis. Also, 23 of the 36 predicted ncRNA transcripts (63.8%) were also detected by the tiling array (Supplementary information, Table S2) 12.

RT-PCR experiments were used to verify our results. Primers designed from neighboring exons of a human housekeeping gene were used as a control to ensure that there was no genomic DNA contamination. Seven putative transcripts, including two candidates filtered by ECgene annotation, were selected for verification, of which six were successfully detected by PCR (Table 1 and Figure 4). The PCR products of candidates E1 (E2), ncR118 and ncR95 were sequenced.

Table 1 Eight candidate ncRNA transcripts used for RT-PCR validation
Figure 4
figure 4

PCR results from the experimental validation of eight predicted ncRNA genes in the human 2BS cell. The ncRNA transcripts were amplified to their maximum length, as the PCR primers were designed at both ends of clustered transcripts. Three candidates (E1, E2 and E3) were filtered using the ECgene annotation. E1 and E2 are repeat sequences.

Discussion

There are huge numbers of ESTs available for mammals, insects, nematodes and plants. Although most ESTs are derived from protein-coding genes, there are still lots of ESTs that are mapped to intergenic regions without detailed annotation. Here, we performed a large-scale analysis of intergenic ESTs to screen for novel ncRNA transcripts. Because most intergenic ESTs are expressed at low levels, sequence conservation and promoter prediction were adopted to increase the reliability of our results. Some predicted ncRNA genes were confirmed by either tiling array transcriptome data or RT-PCR experiments. The transcripts were successfully detected from cDNA synthesized with oligo (dT) as the anchor primers, suggesting that most, if not all, ncRNA genes are transcribed by RNA polymerase II.

It is very likely that some real ncRNA transcripts were filtered out in our work, as only 118 ncRNA transcripts were discovered from more than five million ESTs. This is probably due to the strict criteria used, which improve the reliability but also reduce the sensitivity. For example, we required that the average PhastCons score should be larger than 0.8. This removed a large number of the intergenic ESTs and only kept 6.3% (18 132/288 277) seed windows. However, it has been reported that some ncRNAs are only structurally conserved and their primary sequences share low similarities 13, 14, 15, 16. Therefore, if we adjusted the criteria or performed a structural conservation analysis, more ncRNA transcripts would be discovered. Actually, there were 26 predicted ncRNAs overlapping with previous results based on structural analysis 16 (Supplementary information, Table S1). We suggest that the actual selection criteria should depend on the purpose of the work. In this paper, we have focused on introducing a computational strategy.

ESTs have been well studied for their role in discovering protein-coding genes, but they are seldom used for ncRNA transcript analysis. Our work proves that ESTs could be a 'hidden treasure' for studying ncRNA transcripts. Although tiling array analysis and other genome-scale transcriptome analysis has become useful in identifying noncoding genes, it still remains time-consuming and is costly because most ncRNAs genes are spatiotemporally expressed. Together with comparative genomics analysis, the method we propose is a helpful strategy to exploit large amounts of EST data for discovering ncRNA genes.

Materials and Methods

EST alignment

We used EST alignment resources between human ESTs and the human genome from the UCSC database, which contained 5 977 963 EST alignment entries (hg17, May 2004). To improve the reliability of our analysis, the following ESTs were removed: ESTs that were aligned to multiple regions in the human genome and ESTs that shared less than 90% sequence similarity with the human genome. In total, 3 946 573 EST entries remained.

Intergenic ESTs

The intergenic regions were defined according to the RefGene and the KnownGene tables in the UCSC annotation database 13. Overlapping exons and alternatively spliced variants were merged into a single consecutive region. Then, non-overlapping introns and intergenic boundaries were obtained accordingly. All intergenic regions were divided into 50-nt windows that were not overlapping. The number of ESTs that matched each 50-nt window was counted.

PhastCons score

To estimate the conservation of a given 50-nt window, we used the UCSC PhastCons conservation score in an 8-way vertebrate alignment of human, chimp, mouse, rat, dog, chicken, fugu and zebrafish 13. The average score of 50 positions in the window was defined as the PhastCons score. Human alignment data (hg17) were downloaded from the UCSC website, http://hgdownload.cse.ucsc.edu/goldenPath/hg17/ 13.

Screening candidates

Since we used low abundance ESTs, four strict criteria were chosen to improve the reliability: (1) that at least one 50-nt window could perfectly match with three ESTs or more; (2) that the 50-nt window had a PhastCons score of over 0.8 (for brevity, we defined the 50-nt window that satisfied the above two criteria as the seed window); (3) that a highly likely promoter could be predicted by the Promoter 2.0 software to lie within 2 000 nt upstream or downstream of the predicted ncRNA transcripts; and (4) that the length of any predicted ORF was not more than 300 nt. The ESTs that satisfied the above criteria were kept as candidates.

Filtering known ncRNA genes

Because the ECgene data set contained more gene annotation information than the RefGene and the KnownGene databases, we used the ECgene annotation database at a medium confidence level (version 1.2, hg17) to further filter any known transcripts, such as predicted ncRNAs and alternative spliced regions 14, 15. The candidate ESTs were removed if they had been already annotated in the ECgene data set.

RT-PCR experiments

Total RNA was isolated from human 2BS cells with TRIzol reagent (Life Technologies, Gaithersburg, MD) following the manufacturer's instructions. RNA was treated with Dnase I enzyme following the standard procedure. Primers that were designed from neighboring exons of a human housekeeping gene were used to detect genomic DNA contamination. If there was genomic DNA in the template, two bands were amplified. One was from cDNA that does not have introns, and the other was from DNA that has introns. First-strand cDNA was synthesized from total RNA using SuperScriptTMII RNase H-Reverse Transcriptase (Invitrogen) with oligo (dT) as anchor primers. PCR analysis was performed in accordance with standard procedures with 2 μM of each primer and 2 U ex-Taq DNA polymerase (Takara Corporation). The products were resolved by electrophoresis on 1% w/v agarose gel in TAE buffer (40 mmol/L Tris-acetate, 2 mmol/L Na2EDTA•2H2O) and stained with GlodenView. DNA bands were viewed using a UVP-GDS-8000 system UV transilluminator and the Lab-Works program (UVP, Inc). Some PCR results were sequenced for validation. The primer sequences are listed in Table 1.

(Supplementary information is linked to the online version of the paper on the Cell Research website.)