Identification and Functional Annotation of Long Intergenic Non-coding RNAs in the Brassicaceae

Long intergenic noncoding RNAs (lincRNAs) are a large yet enigmatic class of eukaryotic transcripts with critical biological functions. Despite the wealth of RNA-seq data available, lincRNA identification lags in the plant lineage. In addition, there is a need for a harmonized identification and annotation effort to enable cross-species functional and genomic comparisons. In this study we processed >24 Tbp of RNA-seq data from >16,000 experiments to identify ~130,000 lincRNAs in four Brassicaceae: Arabidopsis thaliana, Camelina sativa, Brassica rapa, and Eutrema salsugineum. We used Nanopore RNA-seq, transcriptome-wide structural information, peptide data, and epigenomic data to characterize these lincRNAs and identify functional motifs. We then used comparative genomic and transcriptomic approaches to highlight lincRNAs in our dataset with sequence or transcriptional evolutionary conservation, including lincRNAs transcribed adjacent to orthologous genes that display little sequence similarity and likely function as transcriptional regulators. Finally, we used guilt-by-association techniques to further classify these lincRNAs according to putative function. LincRNAs with Brassicaceae-conserved putative miRNA binding motifs, short ORFs, and whose expression is modulated by abiotic stress are a few of the annotations that will prioritize and guide future functional analyses.


Introduction:
As genomic and transcriptomic analyses have become more prevalent, it has become clear that genomes are not solely composed of protein-coding genes, housekeeping RNAs, and transposable elements. One particularly important set of findings came from the Human ENCODE (ENCODE Project Consortium 2012) project where it was discovered that over 60% of the human genome is transcribed at some point in development into long non-coding RNAs (lncRNAs). The term "lncRNA" is a catchall for a class of transcripts united by two key features: a length > 200 nt and poor protein coding potential. The term lncRNA itself can be subdivided into natural antisense transcripts (NAT-lncRNAs), intergenic (lincRNAs), sense overlapping transcripts (SOT-lncRNAs), and intronic (int-lncRNAs). Each of these classes of lncRNAs appeared in analyses of RNA-seq data because they look like mRNAs (e.g., they are capped, polyadenylated, and often multi-exonic) (Guttman et al., 2009). However, most lncRNAs were missed or ignored in earlier EST based screens because of their low or tissue-specific expression and lack of open reading frames. RNA sequencing data from over 37,000 experiments reflecting 60 different tissues under different experimental and developmental conditions led to the identification of > 100,000 high confidence lncRNAs in humans (Volders et al., 2013;Zhao et al., 2019;Zhao et al., 2020).
In contrast to proteins, which were the focus of study long before the genomes from which they are encoded were sequenced, an appreciation for the abundance and varied roles of lncRNAs has primarily emerged along with the accumulation of sequencing data. As a result, the catalog of functionally characterized lncRNAs is limited, both in number and in diversity of organisms where they have been annotated (Statello et al., 2020;Seifuddin et al., 2020;Chekanova JA, 2021).
Moreover, the extent to which functionally characterized lncRNAs are archetypal across plants, animals, and fungi is unknown. Thus, lncRNA identification and functional characterization lags far behind similar efforts in proteins, representing a fundamental gap in our understanding of how genomes operate.
Findings from across eukaryotes serve to illustrate the importance of lncRNAs to genome stability and regulation. Prominent mammalian examples include the telomerase RNA component (TERC), a scaffolding RNA that is crucial for chromosome maintenance (Feng et al. 1995); XIST, a guide RNA responsible for X chromosome inactivation (Brown et al. 1992); and HOTAIR, a developmentally-linked signaling RNA (31). In Arabidopsis, TERC has been characterized, with sequence and structural homologs present across the plant lineage, highlighting the potential for Palos K. et al., 2021 4 lncRNA conservation over long evolutionary timescales (Dew-Budd et al. 2020); (Fajkus et al. 2019); . Most other lncRNAs functionally characterized in plants, such as COOLAIR, ELENA1, SVALKA, MAS, APOLO, and HID1, change expression or function in response to environmental cues, and can thus be classified as environmental sensors (Csorba et al. 2014); (Seo et al. 2017); (Kindgren et al. 2019); Ariel et al. 2020;Y. Wang et al. 2014). These examples reflect the myriad of different mechanisms by which lncRNAs play important biological roles in plants, and also likely represent the tip of the iceberg of what remains to be discovered.
One critical factor behind the dearth of functionally described lncRNAs in plants relative to mammalian systems is the lack of annotated lncRNAs, and, where lncRNAs have been annotated, the disparity in criteria and transcriptional data used for identification. In Arabidopsis, the bulk of annotated lincRNAs are derived from two studies (Amor et al. 2009); (Liu et al. 2012), although other genome-wide examinations have been performed (Moghe et al. 2013); (Y. ). The former examined full length cDNA libraries for lack of coding potential, whereas the latter utilized TILING arrays to infer gene structure and transcriptional status. In both cases the maximum allowable ORF was 100 AA or less. Other lincRNA identification efforts outside of Arabidopsis (e.g., GREENC), used official genome annotations generated by MAKER (Cantarel et al. 2008) without direct transcriptional evidence and maximum allowable ORFs of 120 AA. Yet in other systems, lincRNA identification efforts are limited to a few tissues or developmental stages (Qi et al. 2013;Moghe et al. 2013;L. Li et al. 2014;Shuai et al. 2014). While functional lncRNAs have been identified from many of these efforts, the disparity in identification schemes, and discordant developmental stages and environmental conditions makes it difficult to make sequence or structural based comparisons within and across species as is typically done for proteincoding genes.
Here we present a comprehensive and unified annotation of lincRNAs, using criteria established in mammalian systems, across four model or agriculturally significant Brassicaceae: Arabidopsis thaliana, Camelina sativa, Brassica rapa, and Eutrema salsugineum. We reprocessed over 16,000 different publicly available RNA-seq experiments (> 24 Tbp of raw data), and created our own Oxford Nanopore (ONT) and Illumina RNA-seq data, to identify lncRNAs in each of these species. We focus primarily on the intergenic class of lncRNAs for evolutionary and technical reasons: the evolution of NAT and SOT-lncRNAs is obscured by the overlapping Palos K. et al., 2021 5 protein-coding gene, and the unstranded nature of much of the publicly available RNA-seq data makes confident strand assignation of single exon transcripts difficult. Using transcriptomic, proteomic, epigenetic, and genome-wide RNA-protein interaction data, we examined our lincRNA catalog for features that separate and define lincRNAs from other transcriptional units. We used evolutionary and comparative genomic approaches, leveraging the unique strength of plant polyploidy, to identify conserved lincRNAs among the four species and the rest of the Brassicaceae as well as identify conserved motifs for functional testing. Finally, we used all of these contextual clues, as well as guilt-by-association techniques, to assign putative function to lincRNAs within our catalog.

Identification of lincRNAs within the Brassicaceae:
To overcome difficulties in comprehensive lincRNA identification (low-expression, tissue/environmental specificity) in our focal species (Arabidopsis thaliana, Brassica rapa, Camelina sativa, and Eutrema salsugineum), we processed all RNA-seq data deposited to the Sequence Read Archive (SRA) at the NCBI [accessed December, 2018] for these species. We excluded SRAs with epigenetic mutants, degradome experiments (GMUCT and PARE), small RNA-sequencing, and experiments with low sequencing depth (fewer than 1 million quantified/mapped reads; Figure 1). In addition to publicly available short read RNA-seq, we also performed Oxford Nanopore Technology (ONT) PCR-free cDNA sequencing on three tissues (10day seedlings, 4-week mature rosettes, and open flowers) for the four focal Brassicaceae. We used previously developed workflows (Peri et al. 2019) utilizing the CyVerse computational infrastructure (Merchant et al. 2016) to map, in high throughput, ~24 terabases of RNA-sequencing data associated with 16,076 experiments (listed in Supplemental File 1) to their respective genomes. We then identified putative lncRNAs using the Evolinc computational pipeline (Nelson et al. 2017). After identification, we proceeded to filter the initial candidate lincRNAs based on a set of hierarchical filters similar to those used to identify the "gold standard" set of human lincRNAs by Cabili et al, 2011.
The first of these filters pertained to whether we observed these lincRNAs in our ONT sequencing data. Due to the potential for full-length coverage of the ONT cDNA sequencing (Seki et al. 2019), any lincRNAs identified by ONT cDNA sequencing were annotated as high confidence (HC) without further filtering. Using ONT cDNA sequencing we identified 200  6 unannotated (i.e., not present in the Araport 11 annotation) lincRNAs in Arabidopsis thaliana, 945 in Brassica rapa, 1,669 in Camelina sativa, and 563 in Eutrema salsugineum. Due to concerns with transcriptional artifacts common in short read RNA-seq at lowly expressed loci, we next retained lincRNAs as HC if they were multi-exonic. This filter selects for transcripts that are less likely to be artifacts of transcript assembly algorithms (Cabili et al. 2011). By this criterion, 678,12,422,6,200, and 1,812 multi-exonic lincRNAs were identified and annotated as HC in Arabidopsis, Brassica, Camelina, and Eutrema, respectively ( Figure 1C). Cognizant of previously identified and characterized mono-exonic lincRNAs (Lorenzi et al. 2021), (Sun and Ma 2019), (West et al. 2014)), we next filtered the remaining lincRNAs (mono-exonic) and labeled them as  Figure   1B). In contrast,8,867,11,977,7,432,4,893 low confidence lincRNAs were identified in the respective species.
Another source of false positives that we hoped to address comes from mis-annotating a transcript as a lncRNA when it in fact represents a misassembled or fragmented mRNA or is instead an extension of an annotated gene (e.g., a 5'UTR extension). To determine the frequency at which we were recovering false positive lincRNA assignments, we compared independently assembled transcriptomes from Illumina short read and ONT long read derived lncRNA datasets, searching for short read derived "lncRNAs" that were instead 3' or 5 extensions of an adjacent gene within our ONT sequencing data. Using this approach we identified 39 lincRNAs in Arabidopsis that shared at least 1 ONT sequencing read with a neighboring mRNA (must be on the same strand) out of 2,370 lincRNAs for which we obtained ONT coverage ≥ 1. Of the 39 lincRNAs with overlapping sequencing reads, only 2 appeared to be bonafide mRNA extensions (Supplemental Figure 1A). The 37 other lincRNAs appear to share sequencing reads due to misassemblies or genomic DNA contamination in the sequencing (Supplemental Figures 1B and  1C, asterisks), or are larger variants of Araport lncRNAs. In general we identified strong agreement between ONT and Illumina derived lincRNA transcript models (Supplemental Figure   1D), suggesting the depth of Illumina sequencing used here was more than sufficient to overcome misassembly common for lowly expressed transcripts. Given the low rate (1.64%) of false positives, we remain confident that the transcripts we have identified are indeed independently transcribed elements within the Arabidopsis genome.
We next assessed how many of the previously identified Arabidopsis lncRNAs were expressed in our assembled RNA-seq data. Given the comprehensive nature of our dataset, we presumed that a prior annotated lncRNA was misannotated if we did not observe expression above 1 TPM in at least 10 Arabidopsis RNA-seq datasets. There are ~ 3,000 annotated lncRNAs within the Araport11 genome annotation, a group that includes 2455 "lnc_RNAs", 286 "ncRNAs", and 726 "novel transcribed regions". To create a uniform dataset of lincRNAs, we filtered out transcripts that did not fit the most basic definitions of a lincRNA (over 200 nt and not overlapping a protein coding gene), and for which we did not observe expression. Of the 2455 "lnc_RNAs", 401 were removed for protein-coding gene overlap. A further 157 were relabeled as poor support due to lack of sufficient expression levels based on our expression filtering mentioned above (> 1 TPM in 10 experiments). However, we did observe low levels of expression (> 0.1 TPM) for some of these "poorly supported" genes in various tissue expression atlases, stress datasets, or our Nanopore sequencing (Table 1). In total, we confirmed 1,897 Araport lnc_RNAs to be HC-lincRNAs. For the 286 annotated "ncRNAs", 189 (66%) passed the length, intergenic, and expression criteria. Finally, we analyzed the novel transcribed regions and first assessed coding capacity as these features were not originally annotated as noncoding. We treated these transcripts to the same set of filters as our lincRNA dataset (ORF < 100 AA, longer than 200 nts, poor coding potential), which resulted in 571 NTRs annotated as lincRNAs and included in further analyses.
In total, we reclassified 2,566 Araport genes as lincRNAs (Supplemental File 2). Details on the lnc_RNAs, ncRNAs, and NTRs that were removed from the final dataset can be found in

Supplemental File 7.
The definition used by the community to identify lncRNAs is arbitrarily set and may include transcripts that encode for small proteins. Prior results have demonstrated previously annotated lncRNAs as bound to ribosomes, and in some cases have identified protein products for certain "lncRNAs" (Ji et al. 2015;Hsu et al. 2016;Wu et al. 2019). We used Ribo-seq (Ingolia et Palos K. et al., 20218 al. 2009Wu and Hsu 2021) and protein mass spectrometry (MS; (Domon and Aebersold 2006)) data from Arabidopsis seedlings (PRIDE: PXD026713) to identify translated short ORFs (sORFs) and protein products within our "lincRNAs''. Out of the 1,172 lncRNAs expressed > 0.1 TPM in these experiments, we uncovered evidence of translation for 158 lncRNAs (120 Ribo-seq/38 MS) ranging in size from 3-136 amino acids (Figure 2A). There is no correlation between lincRNA and sORF length (Supplemental Figure 2), but we did observe a tendency for previously identified lincRNAs (i.e., Araport-derived, n = 81) to contain longer sORFs than Evolinc-derived lincRNAs (n = 77; p-value 0.046; Figure 2A), likely due to the more restrictive criteria used to annotate the Evolinc lincRNAs. LincRNAs containing sORFs have been denoted as such in Supplemental File 2, but as they reflect previously unidentified genes that would otherwise have been called lincRNAs, were retained as a separate set of transcripts for downstream analyses.
We next used publicly available transcriptome-wide protein-interaction profile sequencing (PIP-seq, (Foley et al. 2017;Gosai et al. 2015)) data from roots (hair and nonhair) and seedlings (GEOs GSE58974 and GSE86459) to identify lincRNAs in our dataset that were protein bound and for which we have some measure of structure. Within the three tissues, we identified 397 structured and protein-bound lincRNAs. 135 (34%) lincRNAs were identified in all three datasets, whereas 195 were restricted to a single tissue ( Figure 2B). Of these tissue-restricted lincRNAs, 119 were found to be structured in root cells, with the vast majority (103; 26% of structured lincRNAs) only present in non-hair root cells (R-NH; Figure 2B). In contrast, most mRNAs (62%) were found to be structured in all three tissues, whereas only 6% were restricted to non-hair root cells ( Figure 2B). Thus, we have evidence for a subset of the Arabidopsis lincRNAs that are protein-bound and structured, with some evidence of tissue-specificity. These lincRNAs have been annotated in Supplemental File 2 and the MSAs are available in the CyVerse Data Store (DOI).
Some lincRNAs are known to interact with miRNAs, either in a competitive inhibitory fashion (i.e., miRNA sponge; (Zhang et al. 2019), or to directly regulate the lincRNA itself (e.g., TAS1A; (Howell et al. 2007;Chen, Li, and Wu 2007)). Using the miRNA binding site prediction tool psRNATarget (Dai, Zhuang, and Zhao 2018) we identified 226 Arabidopsis lincRNAs with at least one putative miRNA recognition site ( Figure 2C). Importantly, within this set of lincRNAs we identified previously identified miRNA-regulated lincRNAs such as TAS1A and TAS1B. We identified a further 668 miRNA-interacting lincRNAs in Camelina,2,741 in Brassica,and 1,168 in Eutrema ( Figure 2C). These putative miRNA-interacting lincRNAs, with their interaction sites, Palos K. et al., 2021 9 are annotated in Supplemental File 5. In sum, we used a wealth of public information to improve the genome annotations of four agricultural or model Brassicaceae.

Fundamental features of Brassicaceae lincRNAs:
We next examined basic characteristics of our lincRNA datasets with the goal of identifying features that might improve future lncRNA identification efforts. LincRNAs in all four species have significantly lower GC content relative to protein-coding genes (P value for all species' lincRNA-mRNA comparison < 2.2e-16, Wilcoxon Signed Rank test; Figure 3A).
Interestingly, when we looked at multi-exonic lincRNAs and mRNAs, we found that the average We next searched for differences in epigenetic regulation between lincRNAs and mRNAs.
Owing to the wealth of genome-wide epigenetic data in Arabidopsis and Brassica, we identified experiments in both species for making direct comparisons, including CpG DNA methylation and H3K27 trimethylation (H3K27me3; see methods; Supplemental File 4). We further divided our gene sets based on expression to better understand the interplay between expression and epigenetic regulation. In Arabidopsis, lincRNA loci are distinct from both transposable elements and proteincoding loci in that they have a consistent decrease in CpG methylation across the gene body ( Figure 3D; Supplemental Figure 3B). This decrease was largely consistent regardless of expression. In contrast, expressed protein-coding loci show the characteristic dip in CpG methylation at the transcription start site (TSS) and increase near the transcription end site (TES).
TEs show elevated CpG methylation across the gene body relative to their surrounding genomic regions. This trend is reversed in Brassica, where protein-coding loci displayed a peak in CpG methylation at the TSS and lncRNAs showed elevated CpG methylation across the gene body ( Figure 3E). Expressed protein-coding and lincRNA loci showed decreased levels of H3K27me3 both across the gene body and relative to non-expressed loci ( Figure 3D), a pattern that was recapitulated in Brassica ( Figure 3E). We also examined H3K9 acetylation, for which data were only available for Arabidopsis (Supplemental Figure 3B). Expressed Arabidopsis protein-coding loci displayed the characteristic peak in acetylation at the TSS and dip at the TES, whereas lincRNA loci displayed an increase in acetylation across the gene body that was positively associated with expression. In general, lincRNAs in Arabidopsis and Brassica are distinguished from protein-coding loci and TEs in that they display similar patterns across their gene body to the patterns associated with the TSS of protein-coding loci, a feature that becomes more pronounced with higher expression.
LncRNAs in mammalian systems are often tissue or cell-type specific, and often lowly expressed at the tissue level relative to mRNAs. This has also been observed to a certain extent in plant systems as well, albeit with far fewer tissue comparisons. Maximum lincRNA expression, in any tissue, was indeed ~10-fold lower compared to mRNAs in all species examined ( Figure 4A and Supplemental 4A). Tissue specificity (TAU; (Yanai et al. 2005)) was determined based on expressions data from tissue atlases in Arabidopsis ( (Klepikova et al. 2016) and Brassica (Tong et al. 2013;Bilichak et al. 2015), as well as from our ONT RNA-seq data. As expected, lincRNAs from all four species were, on average, significantly more tissue-specific than their respective mRNA cohorts ( Figure 4B and Supplemental Figure 4B). We also observed a negative correlation between lincRNA tissue specificity and expression, a feature that was significantly more pronounced than for mRNAs ( Figures 4C and 4D). This negative correlation was observed across multiple tissues (e.g., female reproductive, leaf, and male reproductive; Supplemental Figure 4C), although we did observe tissue-dependent differences, such as high expression associated with high specificity for both lincRNAs and mRNAs in pollen/anther RNA-seq data.
The sORF containing lincRNAs displayed expression and tissue specificity values similar to mRNAs ( Figures 4A and 4B), further supporting an mRNA assignment. Given this link between lower tissue specificity and coding potential, we more closely examined the Arabidopsis lincRNAs (n = 89) with TAU values lower than the median value for mRNAs (TAU < 0.502; Figure 4B, black box). Based on sequence similarity, these broadly expressed lincRNAs do not appear to be recently pseudogenized protein-coding genes, but, for a subset (n = 61), expression is significantly correlated with a neighboring gene less than 500 bp away (Supplemental Figure 4D). Thus, high tissue specificity and low expression can be considered a defining feature of Brassicaceae lincRNAs and can potentially help to distinguish unannotated sORF containing transcripts.
In mammalian systems, a large number of lincRNAs are expressed, or show elevated expression, in male reproductive tissues ( (Hong et al. 2018)). This phenomenon is attributed to relaxed epigenetic control within these tissues. We sought to determine if this was also a feature of plant lincRNAs by examining lincRNA expression within the Arabidopsis and Brassica tissues atlases. Approximately 45 and 35% of lincRNAs in Arabidopsis and Brassica, respectively, were most highly expressed in reproductive tissues, with pollen being the predominant source of maximum expression levels ( Figure 4E and 4F). A similar percent of mRNAs showed peak expression in reproductive tissues in the two species, suggesting a general transcriptome-wide, instead of lincRNA-specific, phenomenon. However, lincRNAs restricted to pollen tissue were expressed significantly higher than lincRNAs restricted to other tissues (e.g., female reproductive versus leaf tissue; Supplemental Figure 4C, note scales). To aid in exploration of lincRNA and mRNA expression between tissues and experiments, these data have been uploaded to the appropriate BAR eFP Browser (Provart and Zhu 2003), and are explorable through an interactive Clustergrammer (Fernandez et al. 2017) Jupyter notebook binder found at https://github.com/Evolinc/Brassicaceae_lincRNAs (Supplemental Figure 5).
Interestingly, 48% and 60.8% of the complete (HC + LC) Arabidopsis and Brassica lincRNA datasets, respectively, were not expressed above 0.1 TPM in their respective tissue atlas suggesting these lincRNAs are not expressed under "normal" conditions during development.
Considering that expression was a requirement for identification, we sought to determine where these "context-specific" lincRNAs (CS-lincRNAs) were expressed. We screened through all of the  Cheng et al., 2016), indicating a high degree of transcriptional variation between genetic backgrounds. We also observed a subset of Arabidopsis lincRNAs (~350) that were only expressed in specific accessions or in crosses between accessions. Finally, 7,407 (10.4%) Brassica CS-lincRNAs were expressed under stress conditions. To allow researchers to sort lincRNAs based on their own priorities, expression metadata have been assigned to each CS-lincRNA in Supplemental File 2. These data highlight both the extreme tissue specificity possible for lincRNAs, as well as the potential for lincRNAs to be expressed during, and perhaps play a role in, growth and development of recent hybrids.

Evolutionary features of Brassicaceae lincRNAs
As conservation is typically seen as a proxy for functional significance in protein-coding genes, we next sought to determine the degree to which lincRNAs from each of the four species were conserved across the Brassicaceae. Using each of the respective sets of lincRNAs as query, we first searched for sequence homologs within the genomes of nine Brassicaceae as well as Tarenaya hassleriana, a representative of the sister family Cleomaceae (Cheng et al. 2013). A comparative analysis of the complete Arabidopsis lincRNA dataset with Evolinc-II (Nelson et al. 2017) revealed that 32.9% (6,781) are species-specific (i.e., no sequence homologs are identified in any other species within the family; Supplemental File 3). Of the remaining ~14,000 lincRNAs, 6,045 (29% of total) are restricted to the genus Arabidopsis (node 1; Figure 5A). Much of these species or genera-specific conservation is driven by the low-confidence lincRNAs. The low-confidence lincRNAs were significantly more likely to be poorly conserved (nodes 0-1; Supplemental Figure   6A) than the high-confidence lincRNAs. Sequence homologs are present for ~35% (4,336) of the Arabidopsis lincRNAs at node 4, which represents the coalescence point between Brassicaceae lineages I and II ( Figure 5A), suggesting these lincRNAs originated ≥ 43 MYA. The majority of these sequence homologs corresponded to either Evolinc-identified lincRNAs or unannotated intergenic sequence in each of the other species (Supplemental Figure 6E), suggesting that these lincRNAs have been evolving as lincRNAs, and not as pseudogenized loci, for the last 43 MY.
The percent of species-specific lincRNAs for the other three species ranges from 49% (C.   Figure 5A). Interestingly, we identified "transcriptional syntelogs" adjacent to multiple CBF1 loci in Brassica, in a similar orientation and distance to CBF1 as the Arabidopsis lncRNA SVALKA (Supplemental Figure 7E). In total, four putative SVALKA transcriptional syntelogs were identified next to CBF1 paralogs in Brassica. A putative transcriptional syntelog was also identified for DRIR1 in Brassica (Supplemental Figure 7B). Thus, a significant proportion of "species-specific" lincRNAs in Brassicaceae may in fact be transcribed from syntenic loci with diverged sequence and therefore harbor conserved cis-regulatory functions.
Given the apparent expansion of the HID1 and APOLO gene families, we asked how frequently lincRNA gene families expanded and contracted, particularly in light of whole genome duplication events. We examined lincRNAs for which we were able to identify a sequence homolog in at least one other organism, and then asked if the number of lincRNAs in each gene family matched expectations based on known genomic events (see Materials and Methods). For Arabidopsis and Eutrema, which have not undergone recent WGDs, lincRNA gene families are predominantly stable (no change for 89% and 81% of lincRNA families, respectively; Figure 5B).
In contrast, Camelina and Brassica lincRNA gene families have experienced a significant contraction, with 67% and 98% experiencing a contraction, typically reducing the lincRNA copy number from the expected three copies back to a single copy. This is a higher rate of contraction than has been observed for protein-coding genes (De Smet et al. 2013). Indeed, 71% of Camelina lincRNAs, and 85% of Brassica lincRNAs are single copy, suggesting weak selective pressures to retain these genes in multicopy form. In addition, in Brassica, where the least and most dominant subgenomes have been assigned (Cheng et al., 2013;Tang et al., 2012), most single copy lincRNAs, and most lincRNAs in general, fall within the least fractionated subgenome (LF; n = 26,284), vs the medium fractionated (MF1; n = 21,712) and the most fractionated (MF2; n = 15973; Supplemental Figure 6F). For each of these sets of lincRNAs, ~50% are predominantly expressed in datasets examining intra-specific hybrids or RILs (Supplemental Figure 6F). These data suggest that lincRNAs, like mRNAs, are preferentially retained on dominant subgenomes following whole genome duplication events, but that lincRNA hybrid-specific expression is not linked to subgenome of origin.
Camelina has a substantial number of multi-copy lincRNA gene families, and thus offers an opportunity to monitor the impact that whole genome duplication events have on lincRNA expression. Camelina is an allohexaploid, consisting of three subgenomes similar to its two progenitor species (C. hispida and a C. neglecta-like autotetraploid), referred to here as the C.
hispida, C. neglecta, and C. neglecta (like) subgenomes. In C. sativa, C. hispida mRNA homeologs are typically more highly expressed relative to those from the other two subgenomes (Chaudhary et al., 2020). To explore how WGD has impacted lincRNA expression, we performed Illumina short read RNA-sequencing in early embryos of Camelina (n = 5). These data were mapped to the reference genome with an updated gene set including our Evolinc-identified lincRNAs. For lincRNAs families with homologs present in the three subgenomes (see Materials and Methods), and with at least one member expressed above 1 TPM in Camelina embryos, we see a significant bias against expression of the homologs from the C. neglecta (like) subgenome, with similar expression in the other two subgenomes (Supplemental Figure 8A). This is in contrast to lincRNAs that are only found in one of the three subgenomes, and thus not evolutionarily related (Supplemental Figure 8B). LincRNAs arising from the C. neglecta (like) subgenome showed a slight but significant elevated average expression relative to the other two subgenomes (Supplemental Figure 8B). Orthologous lincRNAs also behave differently than orthologous protein-coding genes, with C. neglecta showing the lowest average expression relative to the other two subgenomes (Supplemental Figure 8C). These data suggest that there is an evolutionary context-dependent contrast in expression between lincRNAs found in the same subgenome in Camelina.
We observed a number of distinct features within the Arabidopsis lincRNA dataset, including structured regions, sORFs, and miRNA interaction motifs that may act as functional motifs (Lucero et al., 2020). If these elements are important for lincRNA function, then we would expect them to be conserved. Structural elements, inferred from PIP-seq data, strongly correlated with conservation ( Figure 5C). Of the 415 lincRNAs for which structured elements were identified, 324 were conserved outside of Arabidopsis. For 70% of these sequence conserved lincRNAs, the structured region was conserved to the same node as the lincRNA itself, suggesting the structural element itself is driving conservation of the lincRNA. An example of this is shown in Figure 5D, where the two structural elements of lincRNA "Evolinc_tID.00064432" overlap with deeply conserved (i.e., Node 5) portions of the lincRNA.
We next addressed the degree to which the 158 sORF-containing lncRNAs are conserved, as conservation of the sORF would lend support to the idea that these "lincRNAs" are actually protein-coding transcripts. Of the 127 sORF lincRNAs conserved outside of Arabidopsis, there was no significant variation in the overall rate of conservation relative to non-sORF lncRNAs, indicating that sORF-lincRNAs are not preferentially retained. Of the 158 sORF lincRNAs tested, 53 sORFs were conserved in at least one other species, and 39 were conserved to the same node as the lincRNA from which they were derived (see Materials and Methods; Figure 5E; Supplemental File 4). There was no clear bias towards the length of sORF or the encoding transcript (Supplemental Figure 2). Some of the conserved sORFs were quite short, such as the sORF within AT1G06113, which encodes for a nine amino acid peptide and lies within a region of the sORF-lincRNA that shares almost 100% identity across the 11 species present in the MSA ( Figure 5F). Although most sORF-lincRNAs (26/36) were previously annotated lincRNAs (i.e., Araport lincRNAs), a subset were identified in this study, suggesting that current filtering schemes are not entirely sufficient for removing short protein-coding transcripts from our dataset.
Finally, we determined the degree to which the predicted miRNA interaction sites within our Arabidopsis lincRNA dataset were conserved. Of the 226 lincRNAs with predicted miRNA interaction sites, 68 were species-specific ( Figure 5G). A further 83 were sequence-conserved in at least one other Brassicaceae, but the conserved region did not overlap with the putative miRNA interaction site. The remaining 75 lincRNAs contained sequence conserved miRNA interaction motifs, with an example for this shown for AT1G50055 (TAS1B) in Figure 5H. Multiple sequence alignments supporting our conservation assignments for structure, sORFs, and miRNA interaction sites can be found in the CyVerse Data Store (DOI). LincRNAs with conserved domains are annotated in Supplemental File 4. In sum, our evolutionary approach has uncovered conserved lincRNA functional elements that are strong candidates for functional analysis, as well as shed additional light on how plant lincRNAs evolve in the face of WGD.

Assigning putative function to Brassicaceae lncRNAs
Basic characterization of lincRNA expression, along with an evolutionary analysis, can provide clues as to which lincRNAs in our datasets are potentially functional, but these data alone are not sufficient to begin making functional hypotheses or attributing putative functions. To better clarify when and where the lincRNAs in our catalogs are functioning, we took three approaches.
The first was to determine which lincRNAs are stress responsive based on pairwise comparisons of publicly available RNA-seq data (stress vs. control). The second was to use weighted gene coexpression networks (WGCNA) of larger, more complex experiments, to identify modules of similarly expressed protein-coding and lncRNA genes (i.e., guilt-by-association) and infer in which molecular pathway a lincRNA might be acting. Third, as many lincRNAs regulate the expression of neighboring genes (Khyzha et al. 2019;Gil and Ulitsky 2020), we examined correlation of expression of lincRNA-adjacent mRNA gene pairs across tissue and stress expression atlases to identify candidate gene pairs in which the lincRNA might be regulatory.
We first started by annotating which lincRNAs in each species are differentially expressed in response to stress. For Arabidopsis and Brassica, we chose publicly available datasets with multiple independently generated stress experiments (Supplemental File 4). In both species, most of the stress-responsive lincRNAs were specific to a particular stress (Figures 6A, 6B, and Supplemental Figure 9A), with the highest proportion associated with temperature stress (cold, heat, or cold + heat). We observed a similar pattern for protein-coding genes in both species (Supplemental Figures 9B and 9C), therefore we sought to determine which of the Arabidopsis stress-responsive lincRNAs were also ABA responsive. We screened through Arabidopsis RNAseq data associated with seedlings and roots treated with exogenous ABA (5-100uM) and identified 672 ABA-responsive lincRNAs, 105 of which overlap with our stress-responsive lincRNAs, suggesting these lincRNAs may be stress-responsive in an ABA-dependent manner ( Figure 6A, inset). As lincRNAs were predominantly responsive to temperature stress (heat and cold), we next asked how many lincRNAs showed an anti-correlated response to temperature stress (i.e., up in heat and down in cold). In both species, heat/cold responsive lincRNAs were predominantly upregulated by heat and repressed by cold (Figures 6C and 6D). This pattern was specific for lincRNAs, as most mRNAs were either up in both or down in both conditions (Supplemental Figure 9D). Taken together, we observe that a substantial fraction of lincRNAs are differentially regulated during temperature stress in both Arabidopsis and Brassica. Stress and ABA-responsive lincRNAs for each species have been denoted in Supplemental File 2.
Additionally, all differential expression results from the 4 focal Brassicaceae can be found in

Supplemental File 8.
WGCNAs help to identify clusters of genes that are coordinated in their expression and thus potentially regulated by, or are regulating, similar pathways. This allows us to assign putative functions to lincRNAs based on significant co-expression with functionally known mRNAs, a process referred to as guilt-by-association (Tian et al. 2008). To remove noise from normalizing across many disparate experiments, we grouped experiments by tissue or, where available, by project (as in the case of the tissue atlases; see Materials and Methods). In total, we identified 987 lincRNAs in Arabidopsis and 3,473 lincRNAs in Brassica whose expression profiles were sufficient to classify them into at least one co-expression module. For example, when we examine the Arabidopsis Klepikova tissue atlas, we identified a module of 233 mRNAs and 11 lincRNAs (8 newly annotated) whose expression peaked in flowers and male reproductive tissues (i.e., anther and pollen; Figure 6E). Within this module, gene ontology terms associated with fertilization were enriched, suggesting that lincRNAs within this module are also participating in aspects of fertilization (Module associated GO terms found in Supplemental Figure 10A). We also observed lincRNAs falling into co-expression modules within the Klepikova tissue atlas stress experiments.
One particular module contains 182 transcripts (176 mRNAs, 6 lincRNAs) whose expression peaks rapidly after wounding ( Figure 6F). As expected, the GO terms we see with these member mRNAs are highly enriched for response to wounding and jasmonic acid regulation (Supplemental Figure 10B), a hallmark hormone response to herbivory and biotic stresses (J. Wang et al. 2020

Expression for a subset of LincRNAs is significantly correlated with adjacent mRNAs:
LincRNAs are known to regulate the expression of other genes, either in cis or in trans, through a variety of mechanisms ( (Kindgren et al. 2019;Gil and Ulitsky 2020)). One signature of cis-regulatory lincRNAs is correlation in expression relative to neighboring genes across a diverse transcriptomic dataset. To identify putative cis-regulatory lincRNAs, we searched for correlation between all Arabidopsis and Brassica lincRNAs and their immediate neighboring mRNAs that were expressed above 0.1 TPM (i.e., both lincRNA and mRNA > 0.1 TPM) in either their respective tissue atlases or heat stress experiments. In the Arabidopsis tissue atlas we identified 252 lincRNA-mRNA pairs in which both genes in the pair were expressed and for which we could calculate expression correlation (out of 7,875 lincRNAs examined). This correlation was significantly more positive than mRNA-mRNA pairs or random pairs of genes ( Figure 6G; P value=1.62e-14; Wilcoxon rank sum test with Bonferroni multiple testing correction). When examining all genes that fall within 10 Kb of an expressed lincRNA, we observe even stronger positive correlation, in contrast to mRNA-mRNA pairs within the same region, which show very little correlation across all distances measured (up to 10 Kb; Supplemental Figure 11A). We observed even more lincRNA-mRNA pairs with correlated expression during heat stress in Arabidopsis (n = 2,544), again with a significant positive correlation relative to mRNA-mRNA pairs ( Figure 6I). We also observed positive correlation for Brassica lincRNA-mRNA pairs in the Brassica tissue atlas (3,757 out of 23,756 expressed lincRNAs; Figure 6H and Supplemental Figure 11B) and heat experiments (n = 6,514), although this correlation was less pronounced than in Arabidopsis. In sum, we identify a subset of lincRNAs whose expression appears to be positively correlated with neighboring genes up to at least 10 Kb away, suggesting that these lincRNAs might be cis-regulatory RNAs. LincRNA-mRNA pairs with a strong correlation (r > 0.5 or r < -0.5), as well as all correlated neighboring pairs are listed in Supplemental File 2.

Synthesizing our functional assignment approach:
We searched for experiments that took a holistic approach towards the analysis of stress responsiveness where we could assess the active regulation and response of lincRNAs (i.e., not just RNA-seq, but ChIP-seq or other methods to study the regulation of gene expression). Lee and Serres (2019) performed such an integrative approach to understand hypoxia responses in Arabidopsis seedlings. We set out to re-analyze these data in the context of both mRNAs and lincRNAs. By reanalyzing expression data, we observed 153 Arabidopsis lincRNAs that are differentially expressed in response to hypoxic stress. We also observe 62 lincRNAs that fall into a co-expression module with significant enrichment of GO terms associated with hypoxia (Supplemental Figure 13A-H). In Arabidopsis, changes in gene expression in response to hypoxia is regulated predominantly by the transcription factor HRE2 (Hypoxia Responsive Ethylene Responsive Factor 2; AT2G47520). We reexamined Arabidopsis HRE2 ChIP-seq data in the context of the hypoxia stress-associated lincRNAs and found evidence for HRE2 binding to the promoter regions of 20 differentially expressed lincRNAs. We then examined correlation in expression between the HRE2-bound lincRNAs and their adjacent mRNAs, identifying four lincRNAs with positive correlation, one with negative, and two with mixed correlation with adjacent genes (Figure 6K). Thus, these seven lincRNAs appear to be specifically regulated by Palos K. et al., 2021 20 HRE2 in response to hypoxic stress and may act as cis-regulatory elements (annotated in Supplemental File 2).
In sum, we have used a wealth of public data, supplemented with additional short and longread RNA-seq, to identify and provide putative functional annotations for lincRNAs across four Brassicaceae. We combined our transcriptomics data with comparative genomic and evolutionary analyses to determine conservation of not just the lincRNAs themselves, but also inferred functional elements within the RNAs, such as sORFs, structured regions, and miRNA interaction motifs. Using these approaches, we have identified >100,000 Brassicaceae lincRNAs with multiple lines of functional or contextual evidence that will facilitate downstream functional analyses.

Discussion: A comprehensive and unified lincRNA annotation effort for the mustard lineage:
Here we generated an expansive catalog of high confidence lincRNAs for four agricultural and model Brassicaceae species by processing > 20,000 publicly available RNA-seq datasets for those species. We supplemented these publicly available data with our own ONT long read sequencing data, and further annotated the identified lincRNAs with epigenetic, genomic, structural, translational, and evolutionary information. These efforts build on previous efforts to catalog novel transcribed elements within plant genomes (Liu et al., 2012;Moghe et al., 2013), and serve as the most exhaustive lincRNA identification and annotation effort to date in any plant species.
Due to the scale of our efforts and the wealth of data available for these four species, we were able to uncover defining features for Brassicaceae lincRNAs, features that may guide future discovery and annotation efforts in other plant lineages. LincRNAs tend to be mono-exonic, but when multi-exonic, harbor longer exons relative to those seen in spliced transcripts. LincRNAs appear to be epigenetically regulated in a distinct manner from both protein-coding genes and transposable elements. And, as expected based on prior observations in plants and mammals, lincRNAs in all four species were, on average, expressed at low levels and displayed significantly higher tissue specificity relative to protein-coding genes in tissue atlases and our ONT data. The exception to this observation were the sORF containing lincRNAs, which behave more similar to protein-coding genes in terms of both higher expression levels and tissue specificity. Interestingly, many of the lincRNAs we identified displayed high expression in, or were restricted to, very specific cell types (e.g., meristematic tissue) or experimental conditions (e.g., environmental stress) suggesting that 1) lincRNA expression is highly context and cell-type specific, and 2) sampling bulk tissues may not accurately reflect a lincRNA's contribution to the transcriptome.
The lincRNAs restricted to inter-accession crosses as in B. rapa may be the result of improper transcriptional control given their relatively even distribution across the genome or, albeit less likely, may reflect transcripts that help mediate compatibility of two subtly different genomes.

Using comparative genomics to provide functional insights:
Given that we identified thousands of lincRNAs in each of our four focal species, functional analyses will need to be prioritized. In order to facilitate that prioritization, we used a comparative genomic approach to assess the degree to which each identified lincRNA is conserved, and if there are any particular motifs of interest within those conserved lincRNAs. As expected based on prior observations in plants and mammals, we observed low levels of sequence conservation for lincRNAs identified in each of the four species relative to protein-coding genes. However, when sequence homologs were detected between two species (e.g., Arabidopsis to Brassica), those sequence homologs were predominantly annotated as lincRNAs and not protein-coding genes.
Inspired by a smaller comparison between Arabidopsis and Aethionema (Mohammadin et al.,

2015)
, we also searched for and observed a cohort of lincRNAs that are transcriptional syntelogs in that they are transcribed from similar genomic positions in multiple species but share little sequence conservation. LincRNAs that regulate gene expression in cis are an interesting class of transcripts from an evolutionary perspective in that positional and transcriptional conservation may be more critical than sequence conservation. Although additional study is needed, we posit that these lincRNAs may be functionally conserved in regulating expression of the orthologous genes to which they are adjacent in each species. An exciting set of candidates for further study are the putative SVALKA transcriptional syntelogs we identified in Brassica. In Arabidopsis, SVALKA regulates an adjacent, non-overlapping, protein-coding gene through transcriptional interference.
This mode of function in particular might depend more on conservation of transcription, and from where transcription arises, than it does on sequence similarity.
Identifying lincRNAs in species with recent WGD events (e.g., Camelina and Brassica) allowed us to more closely examine how lincRNAs evolve following these genomic events.
LincRNAs are not typically retained as multicopy loci following WGD events. In Brassica, lincRNAs are predominantly retained as single copy from the least fractionated genome. The fractionation -and retention, of a certain set of lincRNAs may suggest functional interactions (e.g., genetic or molecular) are preferentially retained following WGD events -similar to that observed for protein-coding genes (Emery et al., 2018;Schnable et al., 2012). When lincRNAs are retained as multicopy, their expression appears to be more sensitive to the influence of subgenome dominance than protein-coding genes -perhaps explaining why they are fractionated from the genome. However, the retention, and expression, of paralogous lncRNAs such as HID1 may suggest that lncRNAs can be functionally retained post-duplication in a similar manner as proteincoding genes. Further studies are needed to determine if these paralogous lincRNAs (e.g. HID1) have sub or neo-functionalized as is often the case for retained proteins.
Using multiple sequence alignments for our sets of conserved lincRNAs, we also determined if the identified structural, putative miRNA binding, or sORFs were within those conserved regions. Although we did identify examples of conserved sORFs, to our surprise we did not observe strong correlation between sORF and lincRNA conservation. One particularly interesting conserved sORF is found within the lincRNA AT1G06113. The Ribo-seq identified sORF within this lincRNA is only nine amino acids long, but is almost perfectly conserved across the Brassicaceae and even in T. hassleriana. The functional significance of this peptide, as well as the other lincRNA-derived small proteins remains to be determined. In contrast to the sORFcontaining lincRNAs, the regions we identified to be protein-bound and structured were typically conserved to the same degree as the lincRNA itself. This conservation suggests these structured regions are important for function and may bind similar proteins in multiple species. Thus, identifying the protein binding partner in Arabidopsis might help provide functional insights for these lincRNAs across the family as well as develop a protein-RNA interaction database for improving functional predictions.

Using omics-approaches to assign putative function to Brassicaceae lincRNAs:
Our ultimate goal, beyond identifying lincRNAs in each of these species, was to annotate these lincRNAs so as to aid in future functional studies. We used expression data to assign lincRNAs into broad regulatory categories, such as stress-responsive, cis-regulatory, or others associated with GO-terms extracted from network analyses. As most functionally described lincRNAs to date are associated with changes in the environment (i.e., biotic/abiotic stress), our initial expectations were that most lincRNAs would be stress responsive. Interestingly, this was not the case. Roughly 10% of the lincRNAs identified in Arabidopsis and Brassica are stress-responsive, with most responding to temperature stress. While this could be linked to changes in genome-wide epigenetic control that is not specific to lincRNAs, there does appear to be a degree of response specificity.
A majority of the temperature (cold or heat) responsive lincRNAs were either specific to one stress or the other, or showed opposite responses to the two stresses. Furthermore, we also identified a set of lincRNAs whose response appears to be ABA-dependent. The preponderance of lincRNAs associated with temperature stress in our dataset may simply reflect sampling bias as our analyses were dependent on publicly available data. However, given the lincRNAs and NAT-lncRNAs that have already been functionally described as temperature responsive in Arabidopsis (Kindgren et al., 2018;Castaings et al., 2014;Zhao et al., 2018), the potential for widespread adaptation to environmental conditions by lincRNAs remains an exciting avenue for future research.

Methods: Plant materials and growth
Arabidopsis thaliana (Col-0; (Lamesch et al. 2012), Brassica rapa (R-0-18; (Howe et al. 2021), Camelina sativa (cultivar Ames), and Eutrema salsugineum (Shandong; (Yang et al. 2013) seeds were surface sterilized by washing with 70% ethanol followed by soaking in 30% bleach and 1% Tween 20 for 10 minutes before being rinsed and plated on ½ MS media supplemented with 0.5% sucrose. Plates were placed in the dark at 4°C for 5 days before being moved to a long day (16 hour light 22°C/8 hour dark 20°C) growth chamber. Ten days after germination, seedlings were either collected in liquid nitrogen or transplanted to soil and placed into the same growth chamber.
For leaf samples, leaves were either collected 4 weeks after germination, or at the mature most vegetative stage, whichever came first. Finally, for flower samples, opened flowers with no sign of developing fruit were collected. All plant samples were immediately frozen in liquid nitrogen and stored in a -80°C freezer until ready for processing. for Eutrema salsugineum Phytozome v1.0 (Yang et al. 2013). An updated annotation including newly identified lincRNAs for each species can be downloaded from the CyVerse Data Store:

RNA extraction and ONT library preparation
(DOI from CyVerse).
To identify lincRNAs with Evolinc, processed reads were aligned to each species' genome using

Analysis of DNA methylation patterns and histone modification dynamics
LincRNA and mRNA epigenetic profiles were monitored by reprocessing publicly available whole was used to remove PCR duplicates using the "lenient" setting for the "VALIDATION_STRINGENCY" option. These processed BAM files were then used as input for the deepTools "bamCompare'' function with a ChIP input sample (if available) as a comparison experiment. For Arabidopsis, paired RNA-seq experiments were used to determine which genes were expressed for plotting for WGBS and ChIP-seq experiments. For Brassica, the defined tissue atlas was used instead.

Characterization of lincRNA expression patterns
To characterize expression from ONT-seq data, Minimap2 was used to map ONT-reads to transcriptomes for each of the respective species' updated gene sets (prior annotated genes + Evolinc lincRNAs) using similar parameters as above. Minimap2 produced BAM files were used as input for Salmon in alignment-based mode, specifying the --noErrorModel option (Patro et al. 2017) Soneson et al. 2019, (Patro et al. 2017. TPM values were aggregated from each experiment using the tximport R package (Soneson, Love, and Robinson 2015) to obtain gene level expression estimates.
Specific Illumina short read datasets from Arabidopsis and Brassica were used to gain additional resolution of tissue specific expression. For Arabidopsis, the Klepikova et al., 2016 (NCBI PRJNA314076) tissue expression atlas was reprocessed, and for Brassica two datasets were combined to create a tissue atlas similar to Arabidopsis (PRJNA253868 & PRJNA185152). RNAsequencing reads (FASTQ) associated with each dataset were re-aligned to transcripts with Salmon using XXX parameters to generate transcript-level expression values (TPM). Gene level expression values were obtained as above using tximport. To calculate the tissue specificity metric τ TAU, TPM values were first averaged across replicates. TAU was then calculated as described by (Yanai et al. 2005) using quantile normalized TPM values generated from the preprocessCore R package (Bolstad n.d.). To assess tissue of maximum expression, variance stabilized transformed expression values generated from DESeq2 were utilized (Love, Huber, and Anders 2014).
The DESeq2 package (in R), with the DESeq and results functions, were used to identify differentially expressed genes in pair-wise comparisons. For time-course studies, only the first and last treatments were examined, treating each of them as separate analyses (e.g. early stress response vs. late stress response). Genes were considered to be differentially expressed if they had a log2 fold change (L2FC) greater or less than 1 or -1, respectively, as well as an adjusted p-value (qvalue, FDR) of 0.05 or lower.

Analysis of co-expression modules
To assess co-expression modules, raw RNA-seq data from select stress experiments (Supplemental File 6) was re-processed using Salmon as performed above. The tximport and DESeq2 R packages were used to import Salmon quantification files and convert expression estimates to a variety of normalized values (TPM, VST, normalized counts, etc.) The R package CEMiTool (Russo et al. 2018), v 1.10.2) was used to perform gene co-expression network analyses. In most cases, log2 + 1 converted TPM values obtained from tximport were used as input to CEMiTool. Gene ontology (GO) information was provided to CEMiTool from the biomaRt R package (Durinck et al. 2005), and protein-protein interaction data was obtained from the Arabidopsis Protein Interaction Network (Brandão, Dantas, and Silva-Filho 2009).

Measuring expression correlation of adjacent genes
Arabidopsis and Brassica expression data from the above described tissue atlases or heat experiments (Arabidopsis: PRJNA324514, Brassica: PRJNA298459) were normalized using the DESeq2 "vst" function with the "blind" parameter set to false. Additionally, for the tissue atlases, replicates for each tissue were averaged, if applicable. Genes that did not vary substantially across the input experiments were removed. This was performed by calculating the interquartile range for expression of all genes and only those in the top 50% of IQR values were retained (50% most variable genes). Pearson correlation coefficients of expression were then calculated between all remaining genes post filtering using the corrr R package (Kuhn and Wickham 2020), v 0.4.3).
Relevant correlations were then filtered for between lincRNAs and their nearest upstream and/or downstream mRNA neighbors. LincRNA-mRNA pairs separated by fewer than 100 base pairs were removed before subsequent analyses. Random gene pairs were generated from all pairwise correlations using the slice_sample function from the dplyr R package.
To analyze all gene pairs within defined distances, the bedmap function from the BEDOPS suite (Neph et al. 2012) was used with the range, echo, and echo-map-id options. This generated all lincRNA-mRNA or mRNA-mRNA pairs within 200, 500, 1000, 2000, 5000, and 10000 base pairs of each other. These gene pairs were used to filter out the pairwise correlations generated above.

Assessing lincRNA function from multi-omics hypoxia datasets
The multi-omics datasets generated by Lee and Bailey-Serres (2019) were re-processed using the above methods. After processing ChIP-seq data as above, the output BAM files were used as input for HOMER motif analysis (Heinz et al. 2010) following along closely with the provided "Next-Generation Sequencing Analysis" tutorial provided by HOMER. Differentially expressed genes were generated using a basic design strategy in DESeq2. Briefly, the early hypoxia stress was compared to the early control samples (2 hour hypoxia vs 2 hour control using the contrasts option with the results command from DESeq2), and the later hypoxia stress was compared to the later control experiments. Differentially expressed genes for the reoxygenation experiments were not analyzed, but the expression data was used for constructing the DESeq data set and running the differential expression analysis. For identifying relevant co-expression modules including lincRNAs that may be involved in the hypoxia response, CEMiTool was used as above using log transformed normalized counts from DESeq2.
The sORFs were considered translated if they displayed significant 3-nucleotide periodicity and the translated ones were extracted from the RiboTaper output ORF_max_filt file.
To identify lincRNAs harboring putative sORFs based on mass spectronomy data, proteomic experiments, PXD026713 and PXD009714, were retrieved from the PRIDE repository.
Raw chromatograms were analyzed using MaxQuant software (Version 1.6.0.16) with Andromeda-an integrated peptide search engine (Cox at al., 2011). Following search settings were applied: a maximum of two missed cleavages was allowed, and the threshold for peptide validation was set to 0.01 using a decoy database. In addition, methionine oxidation and N-terminal acetylation were considered variable modifications, while cysteine carbamidomethylation was a fixed modification. The minimum length of a peptide was set to at least seven amino acids.
Moreover, label-free protein quantification (LFQ) was applied. Peptides were identified using the Araport 11 database (The Arabidopsis Information Resource, www.Arabidopsis.org) and a library of all Arabidopsis lincRNA ORFs (positive strand) obtained using Transdecoder.

Evolutionary analyses
LincRNA sequence homologs were identified using the Evolinc-II module ( Cheng et al., 2013). For each of the four species, the entire lincRNA list (LC + HC) were included as query in the analyses. LincRNAs were determined to be restricted to a particular node if no sequence homolog was identified in a more distantly related species. LincRNAs were determined to be conserved as lincRNAs or mRNAs in other species if they overlapped by 50% or more with an annotated gene on the same strand. If not, they were considered to be unannotated. Multiple sequence alignments produced by Evolinc-II (using MAFFT) were imported into Geneious (Genious Prime 2021.1.1, https://www.geneious.com) for downstream structure, sORF, and miRNA motif analysis.
Transcriptional syntelogs were identified by downloading the DAGChainer output, with genomic coordinates, from pairwise SynMap analyses between Arabidopsis and each of the three other species (links to regenerate analyses: Camelina https://genomevolution.org/r/1fjg7, Eutrema https://genomevolution.org/r/1f7si, and Brassica https://genomevolution.org/r/1f79g). LincRNAs that were found within syntenic blocks (10 colinear protein-coding genes), between orthologous genes in either of the pairwise SynMap analyses, and in the same orientation to at least one of the neighboring orthologous genes were considered transcriptional syntelogs. To infer lincRNA gene family contraction or expansion, a rudimentary ancestral state reconstruction was performed. For Arabidopsis, ancestral gene copy number for each Arabidopsis lincRNA was inferred by averaging the number of recovered sequence homologs in (at minimum) A. lyrata, C. rubella, and C.
grandiflora, A. thaliana, and A. lyrata were used to determine the copy number in the last common ancestor. This value was then multiplied by three (to account for the Camelina-specific whole genome triplication event). Values above or below this value were considered to be expansions or contractions, respectively. A similar approach was performed for Brassica and Eutrema.
MSAs were manually scanned to infer depth of conservation of sORFs, putative miRNA binding motifs, and structural/protein-binding elements. On top of lincRNA sequence homology and synteny requirements, for a sORF to be considered conserved, the start and stop sites within the annotated Arabidopsis lincRNA must be positionally conserved (within +/-three AA). In addition, the translated amino acid sequence must be 75% identical in pairwise alignments between Arabidopsis and each putative homologous sORF. To identify putative miRNA binding sites, all lincRNAs were scanned for motifs using psRNATarget (Dai et al., 2018) using an expectation score of 2.5 as cutoff. LincRNAs with putative miRNA binding motifs were then compared against the list of lincRNAs that were conserved outside of Arabidopsis. MSAs were then scanned for the presence of miRNA motifs. Motifs with complete coverage and no more than two (pairwise) mismatches in at least one other species were considered for evolutionary comparisons. For conservation of structural/protein-binding motifs, structured regions inferred by PIP-seq (GEOs GSE58974 and GSE86459; Gosai et al., 2015 andFoley et al., 2017) were intersected with lincRNAs using Bedtools intersect (Quinlan et al., 2010). Arabidopsis lincRNAs, their sequence homologs (from Evolinc-II) and structured regions were combined into a MSA using MAFFT (Nakamura et al., 2018) for manual inspection. PIP-seq motifs were considered conserved if the entire motif was contained within an alignable region of a sequence homolog from another species.
For a motif (sORF, miRNA, or structural) to be considered conserved to a particular node, at least one species that shares that node with Arabidopsis was required to contain those motifs under the parameters described above.              Node at which lincRNA sequence is conserved Node at which sORF is conserved