Novel long non-coding RNAs of relevance for ulcerative colitis pathogenesis

Background and aims The study aimed to identify yet unknown and uncharacterized long non-coding RNAs (lncRNAs) in treatment-naïve ulcerative colitis (UC), and to define their possible roles in UC pathogenesis. For that purpose, accurate quantification methods for lncRNA transcript detection, multiple and “stringent” strategies were applied. New insights in the regulation of functional genes and pathways of relevance for UC through expression of lncRNAs are expected. Methods The study was based on sequencing data derived from a data set consisting of treatment-naïve UC patients (n = 14) and control subjects (n = 16). Two complementary aligners were used to identify lncRNAs. Several different steps were used to validate differential expression including plotting the reads over the annotation for manual inspection. To help determine potential lncRNA involvement in biological processes, KEGG pathway enrichment was done on protein-coding genes which co-expressed with the lncRNAs. Results A total of 99 lncRNAs were identified in UC. The lncRNAs which were not previously characterized (n = 15) in UC or other autoimmune diseases were selected for down-stream analysis. In total, 602 protein-coding genes correlated with the uncharacterized lncRNAs. KEGG pathway enrichment analysis revealed involvement of lncRNAs in two significantly enriched pathways, lipid and atherosclerosis, and T-cell receptor signaling. Conclusion This study identified a set of 15 yet uncharacterized lncRNAs which may be of importance for UC pathogenesis. These lncRNAs may serve as potential diagnostic biomarkers and might be of use for the development of UC treatment strategies in the future.


Background
Ulcerative colitis (UC) is a chronic inflamed condition of the colon and rectum and one of the major phenotypes of inflammatory bowel disease (IBD) [1]. Despite the prevalence of UC, the etiology of UC is poorly understood. The UC pathogenesis is complex and an interplay between environmental factors, intestinal microbiome, nutrition and genetic factors [1]. Although heritability plays a potential role, only a small fraction (7.5-22%) of UC risk can be explained by genetic factors alone [2,3]. Genome-wide association studies (GWAS) found several IBD risk loci on the non-coding region of the genome [4]. LncRNAs have not been thoroughly explored in IBD [5] nor has their contribution to the progression of the disease.
LncRNAs play an important role in tumor development and carcinogenesis and have been suggested to be biomarkers for diagnosis and prognosis [6][7][8]. A growing body of evidence implies a role for lncRNAs in UC [9][10][11]. The expression of lncRNAs in UC has previously been reported [9,10,12,13]. They are involved in the modulation of the intestinal barrier function [13,14], regulating expression of inflammatory cytokines [15], and polarization of macrophages [16].
LncRNAs, which are RNAs with a length greater than 200 nucleotides, are poorly conserved [17]. Their roles in gene expression regulation are still not well understood [18]. They may or may not be polyadenylated, and 98% are spliced. At least two different alternatives spliced isoforms have been observed in about 25% of all known lncRNAs [19]. LncRNAs share common features as they are expressed at lower levels, are tissue-specific, and have exonic regions with low levels of interspecies sequence conservation [20]. Weak expression makes accurate quantification of lncRNA transcripts particularly challenging. According to ENCODE's own evaluation, less than 1000 lncRNAs are present at greater than one copy per cell in the typical human tissue culture cell lines [21]. In addition, many lncRNA exons overlap protein-coding exons on the same strand making it difficult to determine the origin of the transcript counts. To ensure the veracity of differentially expressed lncRNAs, several complementary methods need to be employed. Determining lncRNA function is difficult, but protein-coding transcripts that co-express with lncRNA transcripts may offer some insight into lncRNA function. Likewise, pathway enrichment of co-expressed protein-coding genes may offer insight into relevant biological pathways involved in UC pathogenesis.

Patient data
Gene expression data of mucosal gene expression were obtained from the Gene expression Omnibus (GEO) GSE128682 and represent sequencing data obtained from mucosal biopsies of treatment-naïve UC patients (n = 14) and normal control subjects (n = 16) [22].

Data analysis
A schematic overview of the data analysis methodological approach is shown in Fig. 1. The Gencode v36 (GRCh38.p13) reference genome (https://www.gencodegenes.org/) [23] was used for all alignments, annotations and visualization methods. All tests for differential expression were between UC and normal samples. Both Star aligner and Kallisto were used to align the Illumina generated fastq sequences. Star was used to generate a gene count matrix. Kallisto was used to create a transcript count matrix. DESeq2 was used to find DE genes and DE transcripts with an adjusted p value less than 0.05. LncRNAs were defined as those with transcript type or gene type equals lncRNA in the annotation gtf file. Stringtie v2.0.3 (https://github.com/gpertea /stringtie/releases) [24] was used to create a consensus set of transcripts from the Star aligned bam files. The Ballgown (https://www.bioc onductor.org/packages/release/bioc/html/ballgown.html) [25] stattest using transcript FPKM as a metric was used to generate transcript q values from the set of Stringtie consensus transcripts. Granges is a software package that can identify genomic overlaps (https://bioco nductor.org/packages/release/bioc/html/GenomicRanges.html) [26]. Granges was used to isolate lncRNA exons that did not overlap with known protein coding exons on the same strand. A matrix of Ballgown unique exon counts was created from the non-overlapping lncRNA exons. DESeq2 was used to identify differentially expressed exons from the exon matrix, adjusted p value < 0.05. Differentially expressed lncRNA met the following conditions: Star(gene) padj <0.05, Kallisto (transcript) padj <0.05, Ballgown (FPKM) qvalue <0.05, and at least one non protein overlapping lncRNA exon with padj <0.05. Only lncRNA transcripts with an average read count greater than 16 were considered.
LncRNA annotation is constantly updated, therefore Biomart was used to check that the remaining differentially expressed lncRNA transcript type was currently annotated as lncRNA (https://bioconductor.or g/packages/release/bioc/html/biomaRt.html). in latest Ensembl annotation. Finally, each significantly lncRNA was inspected visually (Fig. 2). By using Samtools [27] the read coverage for each candidate lncRNA region was extracted directly from the STAR aligned Bam files. LncRNA transcript read coverage was plotted over the genome reference exon structure using the Gviz [28] package. LncRNAs whose read coverage peaks aligned with reference exons, was greater than the local Fig. 1. Flow diagram representing the outline of experimental steps. Fastq data was aligned using several methods Star, Kallisto, and Ballgown. Differential expression was estimated by DESeq2 for Star (gene counts) and Kallisto (transcript counts), stattest for Ballgown (FPKM). LncRNA candidates were significantly differentially expressed in all three tests. GRanges was used to find non protein overlapping lncRNA exons. Ballgown unique exon counts and DESeq2 were used to ensure that candidates had at least one differentially expressed non protein overlapping exon. Bam read counts were then plotted over genome annotation to ensure exon read count alignment to annotation and comparison to background noise.
background, and did not completely overlap non lncRNA reference exons ( Fig. 2) were considered as candidates.
Co-expression analysis was used to select potential protein coding target transcripts for the fifteen uncharacterized lncRNAs depicted in Table 1. Transcripts with an absolute Pearson correlation greater than 0.85 were selected. If total lncRNA transcript targets were less than thirty, the top thirty most co-related transcripts were taken into consideration. The R ReactomePA package (https://bioconductor.org/ packages/release/bioc/html/ReactomePA.html) [29] was used to find significantly enriched KEGG [30] using gene names of the co-expressed uncharacterized lncRNA transcripts.

Differentially expressed lncRNAs in treatment-naïve ulcerative colitis
DESeq2 on the STAR generated gene count matrix gave a total of 8615 differentially expressed (DE) lncRNA genes (padj <0.05) coding for a total of 28182 lncRNA transcripts ( Fig. 1). Ballgown stattest (q value < 0.05) using FPKM values from Stringtie consensus sequences gave a total of 5614 DE lncRNA transcripts belonging to 4254 lncRNA genes. DESeq2 was used to perform differential expression analysis on transcripts obtained from the Kallisto aligner, which gave 9599 DE lncRNAs transcripts (padj <0.05) belonging to 6720 lncRNA genes.   Table 1). The entire analysis flowchart is shown in Fig. 1. Visual inspection of lncRNAs candidates was done by plotting read coverage of lncRNAs over the exon structure defined in the genome reference annotation (Fig. 2). A total of 116 lncRNAs transcripts representing 99 lncRNA candidate genes were selected ( Supplementary  Fig. 1). Seven of the candidate lncRNAs were previously found to be dysregulated in IBD, fourteen have been observed in colorectal cancer, and six were related to inflammation and infection (Supplementary Table 2). All these 99 significantly differentially expressed lncRNAs are depicted in a heat map (Supplementary Fig. 2). Among these 99, fifteen lncRNAs have not been previously described and characterized in UC (Table 1). Principal component analysis (PCA) using the uncharacterized lncRNAs showed a clear separation between UC samples and normal samples. Principal component 1 (PC1) explained 64.1% of the total variance (Fig. 3A).

Co-expression of lncRNAs with protein-coding genes
The 15 uncharacterized lncRNAs were then subjected to correlation analysis, which resulted in a total of 602 co-expressed protein-coding genes in correlation analysis (coefficient absolute 0.85 with correlation p-value < 0.05) (Supplementary Table 3). In addition, a PCA was performed on the differentially expressed protein-coding transcripts (n = 686) which correlated with the expression of the uncharacterized lncRNAs (Fig. 3B). Here, principal component (PC1) explained 74.6% of the total variance and a clear separation of UC and normal samples was seen. LncRNAs AC110611.2, GOARB-AS1, AC005550.2, and AC116345.4 were co-expressed with 190, 170,112, and 65 proteincoding transcripts, respectively. Correlation analysis showed that multiple protein-coding transcripts can co-express with a single lncRNA transcript and vice versa (Supplemental Table 3). Among the coexpressed transcripts were several protein-coding genes which related to inflammation and UC progression like interleukin 1B (IL-1B) [31], metalloproteinase 3 (MMP3), metalloproteinase 9 (MMP9) [32], and Vav guanine nucleotide exchange factor 3 (VAV3) [33]. Several pro-inflammatory cytokines such as IL-33, TNFSF10 and IL21R co-expressed with AC110611.2.

Pathway enrichment analysis
Genes corresponding to the correlated protein-coding transcripts were used for KEGG pathway enrichment. Two significantly enriched pathways with padj and qvalue <0.05 could be identified, the T cell receptor pathway and the lipid and atherosclerosis pathway. Seventeen and twelve genes, which co-expressed with the uncharacterized lncRNAs were found to be enriched in both pathways. Among them VAV3, lymphocyte cytosolic protein 2 (LCP2), and inducible T cell costimulator (ICOS), both of which play a role in vascular endothelial cell integrity [34], NK-cell mediated recognition of missing-self targets [35], and effective T-helper-cell responses [36]. To illustrate the correlations an example of a co-expression is shown in Fig. 4.

Discussion
In this study, differentially expressed lncRNAs in treatment-naïve UC were explored by applying accurate quantification methods for lncRNA transcript detection. This study provides new knowledge of 15 previously uncharacterized lncRNAs which may be involvedin the regulation of the lipid and atherosclerosis and T cell receptor signaling pathways.
Accurate quantification of lncRNA transcripts is challenging. Therefore, several complementary methods along with visual inspection were applied to generate a set of lncRNAs that distinguish between UC samples and controls (Fig. 1). The majority of lncRNA transcripts are expressed at a significantly lower level than protein coding transcripts, making lncRNA transcription levels difficult to distinguish from the background noise [37]. Recent RNAseq studies have shown differences in intra-exonal coverage, which could have aroused from naturally occurring splice variants sharing part of an exon or could have been due to technical errors in library construction or sequencing [38]. In addition, some lncRNA's exons overlap with other non-lncRNA exons, making it difficult to determine the origin of read counts [39]. Lower counts and overlaps present challenges for lncRNA quantification. Therefore, only lncRNAs containing at least one differentially expressed lncRNA exon that did not overlap a protein coding exon were considered for this study. An example is given in Fig. 2, showing the lncRNA myocardial infarction associated transcript (MIAT). The MIAT read counts map well to the MIAT lncRNA exon annotation and aligns to a greater extent than protein-coding exons. This suggests that the majority of read counts come from MIAT exons and not any protein coding exon overlaps. The MIAT read counts in Fig. 2 are greater than the local background. Plotting the read counts over the annotation strengthened the ability to quantify lncRNA accurately.
Initial PCAs of the uncharacterized lncRNAs (Fig. 3A) and the corresponding correlated protein-coding transcripts (Fig. 3B) revealed a clear separation of UC samples from normal samples in both cases. This indicates that the chosen sample size is satisfactory to make assumptions  14; red)) and normal controls (n = 16; green). The first two components explain 64.1% and 8.3% of the variability in the lncRNA expression data. (B) PCA of differentially expressed coding transcripts (n = 686) which correlate to the uncharacterized lncRNAs (n = 15) presenting the difference between UC (n = 14; red) and normal controls (n = 16; green). The first two components explain 74,6% and 5.6% of the variability in the expression data. on the significance of the results. KEGG enrichment analysis of uncharacterized lnc transcripts correlating with protein coding genes revealed in only two significantly enriched pathways, the lipid and atherosclerosis pathway, and the T cell receptor signaling pathway. The correlation plots for the two pathways are depicted in Fig. 4.
In the lipid and atherosclerosis pathway, lncRNA AC001611.2 expression correlated positively with four genes (MMP3, MMP9, IL-1 and CXCL3) whereas six other lncRNAs correlated negatively with the same genes (Fig. 4A). Perhaps these lncRNAs are involved in the modulation of inflammatory cytokines production, and immune cells migration during UC by regulating the expression of matrix metallopeptidases. A connection between impaired intestinal integrity, cytokine production, and monocytes migration has been reported to be associated with atherosclerosis [40][41][42]. A relationship between UC and atherosclerosis has been implicated [43][44][45][46][47]. The reported higher risk of cardiovascular events in UC patients may be pertinent in inflammation-mediated atherosclerosis [48][49][50] as inflammation and atherosclerosis have been proposed to share similar pathogenesis [51]. Therefore, the identified and previously unknown lncRNAs might qualify for possible new prognostic factors for UC patients with atherosclerosis.
LncRNAs may also play a role in T cell apoptosis during UC. LncRNAs AL354743.2 and LINC0113 correlated positively with the STAT3 transcription factor which induces the transcription of BCL2 and BCL-XL in T cells. The expression of these anti-apoptotic genes can increase the resistance of pathogenic T cells of lamina propria to apoptosis, leading to prolonged inflammation [52].
The T cell receptor-signaling pathway was the second significantly enriched pathway. Several genes involved in this pathway such as PTPRC (CD45), NFATc1, and RASGRP1 were differentially expressed in UC (Fig. 4B). The expression of PTPRC (CD45), a known IBD susceptibility gene, correlated positively with lncRNA AC110611.2 and correlated negatively with six lncRNAs depicted in Fig. 4B. Here, the lncRNAs might contribute to the activation of Cd4+ T cells which are key players in mediating the host protective and homeostatic responses to inflammation [53]. It is interesting to note that these lncRNAs might also play a role in the regulation of the expression of different patterns of alternatively spliced CD45 isoforms that have been shown to be associated with distinct functions [54]. T cell activation of cytokine production is also regulated by the expression of NFATc1 and RASGRP1 both of which correlated positively with lncRNA AC110611.2 and correlated negatively correlated with lncRNAs AC007255.1, AL353572.4, THRB-AS1, AL121790.2, and AC116345. Interestingly, RASGRP1 promotes inflammatory responses by enhancing the production of IL-6 by sponging with miRNA let-7a [55]. IL-6 has been shown to be positively associated with UC development and regulates intestinal barrier function via STAT3 [56].
Many of the up-regulated lncRNAs found by this methodological approach have been already identified and shown to be associated with IBD pathogenesis (Supplementary Table 2), such as small integral membrane protein 25 (SMIM25) [60], IFNG antisense RNA 1 (IFN-G-AS1) [61], and DIO3 opposite strand RNA (DIO3OS) [62]. The observed downregulation of CDKN2B-AS1 is negatively correlated with inflammatory cytokines expression responsible for UC progression [12]. The upregulation of LINC01871 might indicate a dysregulation of T cell inflammatory responses in UC as has been reported for several other autoimmune diseases [63,64]. Overall, our study gives an insight into novel lncRNAs which potentially be involved in intestinal barrier function and immune cell development, activation, and migration. However, loss-and gain-of-function studies are required to verify the biological importance of expression of these lncRNA by in vitro and in vivo experiments. To what extent the uncharacterized lncRNAs contribute to the regulation of the T-cell receptor signaling pathway during UC progression has to be explored in more depth in the future.

Conclusion
This study revealed 15 lncRNAs, which have not been functionally annotated previously and which may be involved in the pathogenesis of UC. The applied methodological approaches together with a visual inspection of read counts over the annotation was key to identifying lncRNA's that were differentially regulated. The results may provide new potential diagnostic biomarkers and therapeutic targets for ulcerative colitis which may improve the understanding of the molecular pathogenesis of UC. However, if lncRNAs are going to be of use as future biomarkers for UC, more reliable approaches for lncRNAs profiling and reliable lncRNA quantification methods are required.

Ethical approval and consent to participate
Not applicable.

Consent for publication
Not applicable.

Availability of data
All data are within the manuscript and its Supporting Information files.

Declaration of competing interest
The authors declare no conflict of interests regarding the publication of this manuscript.