Identification of functional long non-coding RNAs in C. elegans

Background Functional characterisation of the compact genome of the model organism Caenorhabditis elegans remains incomplete despite its sequencing 20 years ago. The last decade of research has seen a tremendous increase in the number of non-coding RNAs identified in various organisms. While we have mechanistic understandings of small non-coding RNA pathways, long non-coding RNAs represent a diverse class of active transcripts whose function remains less well characterised. Results By analysing hundreds of published transcriptome datasets, we annotated 3392 potential lncRNAs including 143 multi-exonic loci that showed increased nucleotide conservation and GC content relative to other non-coding regions. Using CRISPR/Cas9 genome editing, we generated deletion mutants for ten long non-coding RNA loci. Using automated microscopy for in-depth phenotyping, we show that six of the long non-coding RNA loci are required for normal development and fertility. Using RNA interference-mediated gene knock-down, we provide evidence that for two of the long non-coding RNA loci, the observed phenotypes are dependent on the corresponding RNA transcripts. Conclusions Our results highlight that a large section of the non-coding regions of the C. elegans genome remains unexplored. Based on our in vivo analysis of a selection of high-confidence lncRNA loci, we expect that a significant proportion of these high-confidence regions is likely to have a biological function at either the genomic or the transcript level. Electronic supplementary material The online version of this article (10.1186/s12915-019-0635-7) contains supplementary material, which is available to authorized users.


Background
Transcription is not limited to the proteincoding regions of eukaryotic genomes, but instead has been observed to be pervasive in all organisms that have been studied so far. As a consequence of transcriptional activity over noncoding sections of the genomes, tens of thousands of short, <200 nucleotide (nt), and long (>200 nt) noncoding RNAs have now been annotated [1,2] . While much is known about the biological role of most classes of small noncoding RNAs (e.g microRNA, Piwiassociated RNA, small nucleolar RNA, small interfering RNA) [3][4][5] , relatively little is known about long noncoding RNAs (lncRNAs). Whether most eukaryotic lncRNAs are functional has long been debated because of their low expression levels and rapid evolutionary turnover when compared to protein coding genes [6,7] . However, the molecular activities of more than a hundred of such loci have now been described [8][9][10][11][12] including many that appear to regulate the expression of proteincoding genes. Only a small proportion of these loci have been demonstrated to be fundamental to eukaryote biology from mutations that affect their expression or function leading to severe developmental defects or to lethal phenotypes (for example, [13,14] ). While transcription of some lncRNAs has been shown to originate at promoter or enhancer elements with potential DNAdependent function [15] , the activity of others depends on the RNA transcript, acting either in cis or trans ; e.g targeting protein complexes to chromatin or directly interacting with other RNAs, including mRNAs, lncRNAs, or microRNAs [16] . The proportions of lncRNAs belonging to each functional class remain unknown owing to painstaking experimental validations, including both knockout and knockdown assays being required.
C. elegans has been invaluable for the discovery of multiple noncoding RNA pathways and is an important model organism for genetic studies. Nevertheless, only one study has yet sought to identify and annotate lncRNAs in C. elegans , resulting in 801 annotated loci, of which only 170 had evidence of polyadenylation [17] . Furthermore, experimental characterisation of C. elegans lncRNAs has been limited [18][19][20] . Despite evidence confirming their expression [20] , previously identified lncRNAs of C. elegans still lack functional validation [17] .
Using publicly available RNASeq libraries representing diverse C. elegans developmental stages, we sought to annotate novel expressed long noncoding loci and to characterize informative features such as nucleotide composition, evolutionary conservation, transcript expression and functional enrichment. To assess the physiological impact of mutations within these lncRNAs and thus the biological importance of these loci, we used CRISPR/Cas9 to generate large genomic deletions for ten lncRNA loci. Six of these intergenic lncRNA loci yielded significant phenotypes upon deletion, and at least two of these have physiological functions that are RNAdependent. Our study and associated experimental validation demonstrate that physiological lncRNA function in nematodes can be RNA and/or transcriptiondependent. Furthermore, we extrapolate that a significant proportion of the newly identified multiexonic noncoding loci in the C. elegans genome might be functional at the genomic or the transcript level.

Long noncoding RNA annotation in C. elegans
We investigated 209 publicly available RNASeq datasets from diverse developmental stages (Additional File 1) to annotate de novo noncoding transcripts in C. elegans . After filtering for size, coding potential, overlap with existing protein coding genes and loci found in close proximity in the same orientation to annotated genes (see Methods), we identified 3,397 long (>200 nt) noncoding RNAs expressed across C. elegans development (Additional file 2). Only 197 of these loci were annotated previously [17] .
In total, 146 multiexonic and 3,251 monoexonic loci were identified (18 and 179 of which were previously known [17] ). CAGE data [21] was then used to accurately annotate transcriptional start sites, and ChIPSeq [22] and CLIPSeq [23] data were used to identify transcription factor and AGO binding sites within the loci (Additional file 3). As observed in all other model organisms, the identified lncRNA loci are smaller than annotated protein coding genes and are expressed at significantly lower levels (Additional file 4). LncRNA exons also tend to have a GC content that is lower than protein coding sequences but higher than intronic sequences (Fig. 1a, b) as observed previously for other eukaryotes [24] . Interspecies sequence conservation for multiexonic lncRNAs was lower than for proteincoding genes but higher than for monoexonic lncRNAs (Kruskal Wallis test, P =5.73×10 5 , Fig. 1c). Our newly annotated multiexonic lncRNAs show sequence features similar to the final set of 170 lncRNA reported by Nam and Bartel [17] , (Kruskal Wallis test, P =0.79 and P =0.15 for nucleotide conservation and composition respectively) showing the complementarity of these lncRNA annotations.
Distinct chromatin states inferred from histone modifications using ChromHMM [25] have been shown to associate with specific genomic elements (for example, transcriptional start sites and promoters, transcriptional elongation and gene bodies, enhancers, transposable elementderived sequences). Using previously published chromatin annotations in C. elegans [26,27] , we assessed the functional enrichment of our newly annotated lncRNA at each of these annotated genomic elements. Enhancers, identified either by Evans et al. [26] (1.7 fold enrichment P <0.0001) or Daugherty et al. [27] (2.0 fold enrichment, P <0.0001), significantly overlapped with these lncRNA loci but chromatin states associated with transcription elongation ("transcribed gene body") were depleted at all developmental stages (Figure 1d, Additional File 5). These results could be explained by the observed low expression level of the lncRNAs. Our results are in agreement with Evans et al. [26] , who showed that the chromatin states reflecting transcription elongation were associated with the most highly expressed genes in their study. Active enhancers were particularly enriched within single exon lncRNAs at all developmental stages (2.0, 2.6 and 2.4 fold enrichment respectively at the early embryonic, L3, and young adult stages, P<0.001 in all comparisons, Additional File 5). This result is consistent with enhancer RNAs rarely being spliced [28] . In contrast, multiexonic loci were only enriched for active enhancers during early embryonic stage  Fig. 2a). This restricted expression could reflect that many of the newly annotated loci are the result of transcriptional noise, and therefore likely nonfunctional [15] . However, many of the remaining lncRNAs, most specifically multiexonic lncRNAs, appear to be expressed in a tissue and stagespecific manner (Fig. 2b) . 26 lncRNAs (10 multiexonic) were expressed in more than 90% of the libraries (≥188 libraries). Highly reproducible loci ( ≥ 100 libraries) tended to have a significantly higher sequence conservation (Kruskal Wallis test, corrected P =0.0077) and higher GC content (Kruskal Wallis test, P =3.5×10 4 after Bonferroni correction) compared with loci with limited reproducibility (<10% libraries) (Fig. 2c, 2d). Highly reproducibly expressed loci also tended

Functional characterisation of lncRNA loci
Higher sequence conservation of the 146 multiexonic lncRNA loci, together with their higher exonic GC content and their splicing, could reflect organismal function. To test this hypothesis, we used CRISPR/Cas9 genome editing to generate targeted deletions in 10 of the multiexonic lncRNA loci that each showed high sequence conservation, high expression, evidence from multiple libraries and that are not overlapping with neighbouring coding regions. We were successful in generating large genomic deletions for ten of these lncRNA loci (Table 1) out of twenty that were initially targeted. This success rate was due mostly to the limited efficiency of plasmid based protocols that were available at the time, as compared to direct protein / RNA injection methods developed later [29] . Of these lncRNA locus deletions, nine removed at least one exonic region and one removed a region just 5′ of a lncRNA locus (Additional file 6).
All 10 lncRNA deletion mutants initially failed to display overt, gross phenotypes such as sterility, embryonic lethality or abnormal body development. To undertake a more extensive characterisation, we captured the development of the mutant animals alongside wild type control animals using an automated microscopy system. This system records the development of multiple animals simultaneously, and permits phenotypic analysis in an unbiased manner. Two phenotypes that can influence the lifehistory and fitness of populations [30] , brood size and growth rate, were selected for the automated analysis. Six of 10 lncRNA deletion mutants (linc217, linc239, linc249, linc260, linc305 and linc339) yielded significantly reduced brood size (Fig.3a) and 4 of 10 mutants (linc206, linc217, linc239 and linc249) displayed reduced growth rate (slower body size increase) over development (Fig.3b). Three mutants (linc217, linc239 and linc249) showed alterations of both phenotypes (Table 1  These phenotypes could be due to the removal of either the lncRNA transcript or of the genomic locus which, in some instances, harboured annotated transcription factor binding and enhancer sites (Additional file 3). To distinguish between these two possibilities, we generated dsRNA expression vectors for RNAi targeting of the lncRNA transcripts in wild type animals (Table 1).
Using four biological replicates per assay, we targeted 6 lncRNA transcripts using RNAi. We left out linc240, linc328, linc260 and linc305 because these were either lacking any phenotype or yielded only a weak phenotype when deleted.
Of the four of these six lncRNA loci whose deletion yielded a brood size phenotype, two (linc239 and linc339) yielded an equivalent phenotype when expression was reduced using RNAi ( Fig. 4a); for one of these lncRNA loci, linc239, equivalent reduced growth rate phenotypes were observed for both its knockout and knockdown (Table 1, Fig. 4b). Direct comparison of the phenotypes between lncRNA deletion mutants and RNAi knockdown of lncRNAs shows that RNAi knockdown phenotypes are slightly weaker or equivalent to deletion mutants ( Fig. 5a,b). Expression of the two lncRNAs, linc239 and linc339, was validated by RTPCR (Additional file 7) and we calculated their expression to be highest during larval development in comparison to embryogenesis (Fig. 5c). This indicates that, for these two loci, the phenotypes are caused by the disruption of their RNA transcriptdependent functions. RNAi targeting of another lncRNA, linc339, also showed a reduced growth rate, a phenotype that was not observed in the deletion mutant, which could therefore be a consequence of offtarget effects (Fig. 4b).Three additional lncRNA strains (linc206, linc217, linc249) yielded discordant phenotypes when disrupted or subjected to RNAi (Table 1; Fig. 4a). This would be consistent with functions of these loci being RNAindependent.

Discussion
The identification of functional noncoding elements, including transcribed noncoding sequences, in genomes has long relied on computational predictions based on sequence conservation [31] , or biochemical activity [32] . However regardless of the preferred approach to predict functional In our study, we first provide a novel annotation of intergenic lncRNAs in C. elegans. This work expands on the previous annotations delivered by Nam and Bartel [17] as a substantially more comprehensive RNASeq dataset was available at the time of our study (209 vs 35). We also took advantage of existing resources to further improve the annotations for these loci. These included not only their expression pattern and nucleotide conservation but also (i) the presence of potential functional elements within them (transcription factor binding and AGO binding sites), (ii) their correlation in expression with neighbouring protein coding genes, and (iii) the reported mutant phenotypes for these genes. The primary aim of these comprehensive annotations was to inform the selection of candidate lncRNA loci for followup experimental validation. Most importantly, we went beyond computational predictions of functionality as we assessed the invivo phenotypic effect of knockingout a selection of ten intergenic lncRNAs and implemented knockdown assays to validate the observed phenotypes and putative transcript mediated function of these loci.

lncRNAs of C. elegans
Our newly annotated loci bear all the hallmarks of lncRNAs in other organisms: they tend to be shorter, expressed at lower levels and have lower degree of conservation than protein coding sequences [6,7,11] . Furthermore, the GC composition of the multiexonic lncRNAs in C. elegans does mirror the patterns previously observed in other animals, with increased GC content within exons relative to introns [24] . These similarities with other animal lncRNA annotations implicate C. elegans as a model organism that is more broadly relevant for investigation of the molecular functions of lncRNAs and the processes through which those functions are conveyed. Most importantly the wealth of resources available for C. elegans as a model organism, offer the opportunity to assess the in vivo impact of mutations within these loci. The observation of enrichment for enhancer sequences within our lncRNA loci emphasises that the observed function of a locus could be conveyed by discrete functional DNA elements located within it, rather than by the RNA transcribed at this location. The former would imply that transcription at this location either reflects or maintains open chromatin states and that the resulting transcript would likely be biologically inconsequential, whereas the latter would imply RNA sequencedependent functionality of the resulting transcript [10,33] . LncRNAs with transcript mediated function have been shown to act both in cis (Xist, [34] ) and in trans (Paupar, [35] ), whereas those whose function is transcription regulation related are expected to act in cis (transcriptional interference, chromatin modification at enhancers). This duality, transcription versus transcriptmediated function, is a recurrent issue when studying lncRNAs, and only the careful experimental characterisation of each locus through knockdown, and rescue can begin to deduce the functional mechanism associated with a noncoding transcript [36] .

Phenotypic characterisation of lncRNAs
Historically, majority of C. elegans genes were identified through genetic screens which concurrently provide phenotypic and functional information. Mutations identified in noncoding regions of the genome as a result of genetic screens, nevertheless, have largely remained uncatalogued. With advances in genome editing methods, it is now possible to directly target noncoding regions for mutational analysis. The lncRNA annotations presented in this study, together with the detailed documentation of their expression and overlap with existing datasets, serve as a guide for the targeted analysis of these loci during animal development.
In C. elegans , many small noncoding RNA genes lack discernible phenotypes when deleted individually [37] . This is mostly due to redundancy between noncoding RNAs and the role such RNAs play as buffers in gene expression regulation rather than being the master regulators. The situation may be similar for the majority of lncRNAs, because their roles in the regulation of gene expression remains incompletely understood and the phenotypic characterisation of many vertebrate lncRNAs has been challenging and has provided sometimes contradictory results [38] .
By using automated microscopy, we sought to capture the phenotypes associated with lncRNAs in an unbiased manner. The observed reductions in brood size and growth rate of lncRNA loci deletion mutants greatly affect the fitness of these animals, despite their otherwise normal appearance. For two of the lncRNA loci, linc239 and linc339, the phenotypes can be recapitulated by RNAi knockdown. We thus consider these two lncRNAs as being representative of bona fide C. elegans lncRNAs. However, further experiments will be required to completely rule out that these lncRNAs are not translated into functional, short polypeptides [39] . The lack of phenotypes upon RNAi knockdown of the remaining loci could be attributed to the possibility that observed phenotypes in deletion mutants arising from the removal of DNAdependent functional elements. It is also possible that the transcripts of these loci are solely nuclear or expressed in neuronal tissues, and thus resistant to RNAi in C. elegans [40,41] .

Conclusions
In this study, we increased the current number of potential lncRNAs in C. elegans from 801 to 4001. Together with the previously identified highconfidence lncRNA loci, in total 298 loci yield evidence for possible biological functions because they display higher conservation, higher expression, higher GC content and splicing. Using genome editing and RNA interference methods, we tested the functional relevance of ten of these loci and demonstrated that six yield in vivo phenotypes when deleted. Furthermore, we showed that for at least two out of these six loci the function is likely conveyed by the RNA transcript. From our invivo assays, we estimate by extrapolation that 4060% of the multiexonic lncRNAs identified in this study might have biological roles. It will be essential to employ sensitive experimental approaches to decipher the fitness effect of such noncoding loci.

Intergenic lncRNA identification
A total of 209 publicly available libraries were retrieved from the SRA database ( https://www.ncbi.nlm.nih.gov/sra/ , Additional File 1). Reads were mapped onto the C. elegans (ENSEMBL release 73, WBcel235) reference genome using TOPHAT2 [42] . For each library, de novo transcripts were called using cufflinks2 [43] and the coding potential of all new intergenic transcripts was assessed using the Coding Potential Calculator (CPC, [44] , score <0). All of the loci for which every transcript was deemed noncoding, were retained for further analyses as potential intergenic lncRNAs. All lncRNAs across all libraries were merged into a single annotation file using cuffcompare. We retained for final analyses only the loci longer than 200 nt, not overlapping any annotated gene and found at least 50 nucleotides away from annotated genes if located on the same strand.

Intergenic lncRNA conservation
The nucleotide conservation of the candidate loci was assessed using the conservation tracks (PhyloP) from the UCSC database (http://hgdownload.soe.ucsc.edu/). The tracks represent the nucleotide conservation across 26 nematode species.

Chromatin modifications, transcription binding sites and enhancers associated with lncRNAs
In order to facilitate the prioritization of lncRNAs for mutagenesis we parsed publicly available data to further improve our annotations. We intersected our annotated loci with highly occupied target regions [45] , miRNA binding sites [23] , transcription factor binding sites identified by modENCODE [22] , and enhancers identified by Chen et al [21] . We also computed the distance to the closest protein coding gene as well as the correlation in expression between lncRNAs and their upstream and downstream flanking protein coding genes. Finally, we also reported the known phenotypes for the proteins flanking lncRNAs. Genomic locations of the respective annotations were transferred to the ce11 genome assembly using liftOver and files available on the UCSC database. Enrichment analyses were performed using the Genomic Association Test (GAT) software [46]

Identification of transcriptional start sites
We used the 5′ end tag sequencing data from Chen et al [21] to identify the putative transcriptional start sites of the intergenic lncRNAs. We applied the same approach the authors previously applied to their data. Clusters with at least 2 reads were kept and merged if on the same strand and within 25 nucleotides of each others.

CRISPR/Cas9 mediated deletion of lncRNA loci
lncRNA loci were deleted using either plasmid base injection [47,48] or direct protein / RNA injection methods [29] , as previously described. gRNA sequences and the primers used for screening of the F2 generation animals are given in Additional file 8. Isolated deletion mutants were backcrossed once to wild type animals. For lncRNA sequences and deletions see Additional file 9; for genotyping results of deletion mutants see Additional file 10.

Cloning of RNAi vectors
Genomic sequences corresponding to the lncRNA loci were amplified using the primers listed in Additional file 8, cloned using Gibson Assembly [49] into the L4440 vector and transformed into competent E. coli strain HT115 [40] .

Automated microscopy analysis
Growth Curves: Growth curves were estimated using longterm video imaging. In short, a custom camera system was used to record backlit images of C. elegans from the exutero egg stage to the egg laying adult stage (~65 hours). To accomplish this, an imaging system was built, which allowed 12 video cameras (Flea3 3.2MP monochrome, Point Grey) to record in parallel. These were used to record images of 40 C. elegans nematodes in 16mm circular arenas continuously at 1Hz for~3 days. These "miniwells" were placed in an enclosure where temperature was maintained at . The resulting movies were analyzed offline with a custom written MATLAB script 0C ±20 mK 2 (Mathworks). Tracking was based on the Hungarian Algorithm for linear assignments, [50][51][52] , and yielded spatial trajectories and time series of attributes such as the area of the 2D projection counted this way for 10 additional frames. The final count was returned as the mode of these counts. This system was tested on plates with fixed numbers of animals and was accurate to within 5%, comparable to human precision. Total brood size was reported then as the sum of the three days. For mutant strains, this experiment was done for 5 animals of each strain three times. For the RNAi experiment, this was done for 6 animals for each RNAi clone, also done three separate times. Data is censored for animals that crawled off of plates.

Declarations
The authors declare that they have no competing interests.