Identification of the conserved long non-coding RNAs in myogenesis

Our understanding of genome regulation is ever-evolving with the continuous discovery of new modes of gene regulation, and transcriptomic studies of mammalian genomes have revealed the presence of a considerable population of non-coding RNA molecules among the transcripts expressed. One such non-coding RNA molecule is long non-coding RNA (lncRNA). However, the function of lncRNAs in gene regulation is not well understood; moreover, finding conserved lncRNA across species is a challenging task. Therefore, we propose a novel approach to identify conserved lncRNAs and functionally annotate these molecules. In this study, we exploited existing myogenic transcriptome data and identified conserved lncRNAs in mice and humans. We identified the lncRNAs expressing differentially between the early and later stages of muscle development. Differential expression of these lncRNAs was confirmed experimentally in cultured mouse muscle C2C12 cells. We utilized the three-dimensional architecture of the genome and identified topologically associated domains for these lncRNAs. Additionally, we correlated the expression of genes in domains for functional annotation of these trans-lncRNAs in myogenesis. Using this approach, we identified conserved lncRNAs in myogenesis and functionally annotated them. With this novel approach, we identified the conserved lncRNAs in myogenesis in humans and mice and functionally annotated them. The method identified a large number of lncRNAs are involved in myogenesis. Further studies are required to investigate the reason for the conservation of the lncRNAs in human and mouse while their sequences are dissimilar. Our approach can be used to identify novel lncRNAs conserved in different species and functionally annotated them.


Background
Recent transcriptomic studies of mammalian genomes have revealed the presence of a substantial population of non-coding RNA (ncRNA) molecules among the transcripts expressed in cells. More than 90% of the human genome encodes ncRNAs [1][2][3], and the presence of such a large collection of ncRNAs indicates the regulatory potential of these molecules [4][5][6] . Based on size, ncRNAs are grouped into two classes: short ncRNAs and long ncRNAs. Short ncRNAs, fewer than 200 bp in length, include microRNAs or piwi-interacting RNAs; long ncRNAs (lncRNAs) are greater than 200 nucleotides and transcribed mostly by RNA polymerase II. Similar to messenger RNAs, lncRNAs contain a 5′7-methylguanosine cap and a 3′ poly(A) tail; however, lncRNAs lack coding potential. This new class of genes has recently been identified in various tissues [7][8][9][10]. Although the functions of microRNAs are well studied [11], the mode of action of lncRNAs in gene regulation is not well understood. Previous studies in X-chromosomal dosage compensation underscore the regulatory potential of lncRNAs, whereby the mechanism is carried out via concerted action of the lncRNA Xist and protein complexes [12,13]. Recent studies have revealed the involvement of lncRNAs in Drosophila dosage compensation. This dosage compensation system employs two lncRNAs (roX1 and roX2), which are essential for other proteins to form the Male-Specific lethal complex and for targeting of the complex to hundreds of distinct sites on the X chromosome in male fruit flies [12,[14][15][16][17]. Recent studies provide evidence that lncRNAs play important roles in normal physiology and many diseases [6], including embryonic stem cell maintenance, differentiation and development [18], the antiviral response [19], gene imprinting [20], and cancer progression, as well as vernalization in plants [21]. Furthermore, the ENCODE project (GENCODE v26) has annotated thousands of lncRNAs in various cells [6], though further studies are required for functional annotation of these lncRNAs.
In addition, evidence for the involvement of lncRNAs in embryonic or adult skeletal myogenesis and muscle diseases is growing [22][23][24][25][26][27]. Therefore, we selected the process of myogenesis as a case study to identify lncRNAs from large transcriptome data in mice and humans and annotated the functional roles played by lncRNAs in skeletal myogenesis. We determined differentially expressed lncRNAs in myoblasts and myotubes and confirmed expression with epigenetic marks, such as histone modifications. Additionally, we determined conserved lncRNAs by investigating the shared synteny of the lncRNA with nearby genes in both mouse and human. We further functionally characterized the identified lncRNAs based on their association with the genes in their vicinity. In general, lack of sequence homology and conserved secondary structure of these lncRNAs make the functional annotation a challenging task [28][29][30], and there have been many previous attempts at functionally annotating lncRNAs. In some cases, the function of lncRNAs has been inferred by exploring relationships between lncRNAs and nearby proteincoding genes [31], and some roles have been predicted by identifying coding genes co-expressed with lncRNAs [32,33]. We obtained the structures of the identified lncRNAs from Conserved-RNA Structure (CRS) database [34]. Some of these lncRNAs show moderate structural conservation, which also indicates a common role in mice and humans. Subsequently, we functionally characterized lncRNAs by examining the gene ontology of neighbouring genes, as well as by investigating the ontologies of genes in close vicinity in three-dimensional space. Some of these identified lncRNAs were experimentally validated in C2C12 cells, and the results revealed that the computationally identified lncRNAs are indeed differentially expressed in these cells.

Results
The objective of this study was to identify conserved lncRNAs between humans and mice. Hence, we first identified lncRNAs present in mice and correlated their expression with nearby genes, epigenetic marks and histone modifications. The expression of a few identified lncRNAs was experimentally confirmed. The lncRNAs identified from mice were compared with human datasets to identify conserved RNAs. Finally, the functional role of these lncRNAs was assessed by overlapping them with topologically associated domains and investigating the function of the genes in these domains.

Identification of lncRNAs involved in mouse myogenesis
To identify lncRNAs in the mouse skeletal muscle system, we used Trapnell et al.'s C2C12 myoblast and early myotube (3 days after differentiation) deep RNA sequencing (RNA-Seq) data [35]. The reads from the dataset were aligned and mapped to the mouse genome (version mm10). A total of 55,874 transcripts were identified. Protein-coding genes were excluded from this analysis. Transcripts of > 200 bp with no coding potential were selected as lncRNAs. The filtered lncRNAs were annotated by using a mouse genome annotation file. We selected lncRNAs that were temporally regulated during myoblast differentiation, as these lncRNAs may have a role or assist in myogenic differentiation. Significant lncRNAs were selected based on Log2 fold change 1 and False Discovery Rate (FDR) < =0.05, identifying 2059 differentially expressed lncRNAs in the dataset. Among the identified lncRNAs, many have been previously shown to be expressed in C2C12 cells and involved in muscle development and differentiation. We detected expression of known lncRNAs, such as NEAT1, H19, MALAT1, Linc-MD1, MYH, MUNC, Lnc-31HG, LncMyoD, SRA1, and RPL12P8, in the myotube stage, corroborating earlier studies [22,[36][37][38]. Linc-MD1, LncMyoD, Malat, and SRA1 are involved in myoblast differentiation, whereas Lnc-31HG and RPL12P8 play a significant role in myoblast proliferation. In addition to these known lncRNAs, we identified 57 conserved lncRNAs in this dataset (Table 1 and Additional file 1: Table S1). Annotation of some of these lncRNAs were found in FANTOM database [5] and these includes enhancers and promoter lncRNAs. The logCPM values derived from RNA-Seq data by Trapnell et al. were compared with the gene expression data from Liu et al. [39] and found to be highly correlated (correlation coefficient = 0.67 and p-value < 2.2e- 16), indicating a consensus between these studies.

Expression pattern of lncRNAs and nearby genes
We observed some lncRNAs are highly expressed in the myoblast stage and decrease expression in the myotube stage; and some highly expressed in the myotube and have low expression in the myoblast stage, which is termed as myoblast-specific and myotube-specific lncRNAs, respectively. Read densities of the lncRNAs in myoblast and myotubes revealed that myotube-specific lncRNAs begin to be expressed at the myoblast stage and that levels increase during the myotube stage. We investigated nearby genes to determine the possible targets of the lncRNAs. Previous studies considered genes within 10 kb as candidate targets [40], and we observed a similar pattern with nearby genes (within the 10 kb region). Comparison between the level of myoblastspecific genes during the myoblast stage and myotubespecific genes in the myotube stage showed that the latter are expressed at a higher level than the former. Moreover, expression of myoblast-specific genes and lncRNAs decreases at the myotube stage ( Fig. 1). In human dataset also, we observed a similar behaviour, highly expressed lncRNAs and genes in myoblast stage decreases at myotube stage. Genes which were highly expressed in myotube, started their expression in early stage and increases in later stage (Additional file 2: Figure S1). In mouse, we did not observe a higher expression of myoblast-specific genes and lncRNAs in myoblasts because we considered the later stage of myoblasts. At this stage, myoblast-specific gene expression is destined for silencing, and myotube-specific genes are triggered for expression.
We compared the observed expression pattern with the distribution of RNA polymerase II (PolII) (Fig. 1 e and f) using the PolII binding profile obtained from Asp et al.'s study [41]. As PolII is involved in lncRNAs expression, we investigated the distribution of PolII at the TSS of selected lncRNAs and nearby genes and observed a similar pattern of distribution. Specifically, we found considerable enrichment of PolII on myotube-specific genes in myoblasts (Fig. 1e), suggesting that these gene regions had already been converted to active chromatin.

Comparative analysis and identification of conserved lncRNAs in mouse and human
In mammals, muscle development occurs through distinct myogenic waves and is evolutionarily conserved. Moreover, transcription factors responsible for the commitment of mesodermal cells to a muscle lineage and the initiation and maintenance of the terminal differentiation programme are highly conserved in mammals [42]. To identify lncRNAs conserved between mice and humans, we matched the lncRNAs identified from mice with those in humans using Zeng et al.'s RNA-Seq data [43], which are comprehensive single-cell and single-nucleus RNA sequencing data generated to study gene expression profiles in undifferentiated myoblasts and myotubes (72 h after induction of differentiation) in Hu5/KD3 (KD3) cells. Pairwise sequence comparison of the lncRNAs from humans and mice revealed very weak conservation (Additional file 3: Figure S2). Therefore, to find conserved lncRNAs, we first selected the neighbouring genes (upstream and downstream) of the mouse lncRNAs as a reference point (Fig. 2). These genes were delineated in the human genome, after which we investigated whether any lncRNAs are located near these reference genes in the human dataset. Thus, we identified common lncRNAs in mice and humans based on the reference genes. To re-verify the sequence conserveness, we have annotated the lncRNA conserved sequence alignment information among 100 vertebrates species by using the MULTIZ alignment program provided in the RNA-Central database(v14, [44]). While six of the identified lncRNAs (RP11-887P2.5, RP11-366 L20.2, LINC MD1, CARMN, AC007383.3, MALAT1) were highly conserved across species (mean phastcon score ranges from 0.80 to 0.99), most of them showed moderate to poor conservation (Additional file 4: Table S2). Further, the consensus structure of the lncRNAs was built by using CRS database (Additional file 4: Table S2 and Additional file 5: Figure S3). The database holds the information of vertebrate genomes for conserved RNA structures and consensus structure was built based on the CMfinder program using the expectationmaximization algorithm using covariance models [34]. Some of these lncRNAs showed consensus secondary structure, which implies that they may have some common role to play in myogenesis.
While analysing these common lncRNAs, we found some known common lncRNAs, such as NEAT1, XIST, Table 1 The table shows the number of lncRNAs conserved between humans and mice. Among the 57 RNAs, 15 are lncRNAs, and the remainder are e-lncRNAs and p-lncRNAs annotated by FANTOM database. The function column shows the gene ontology of the genes associated with lncRNAs in the TAD. The functional annotation reveals that all of the identified lncRNAs are involved in developmental processes. However, some of the lncRNAs are involved in muscle development, and a few are involved in chromatin organization and H19. We also noticed that the synteny around the lncRNAs is conserved between mice and humans. Overall, we identified 57 conserved lncRNAs (Table 1). Of these 57 lncRNAs, there were only 13 with nearby genes within the 10-kb window, and the remaining 44 were in gene desert regions. The conserved location and moderate structural conservation may indicate a common role of these lncRNAs in both the organism. Because a large number of RNAs are located in gene desert regions, we examined whether they are in enhancers. To determine the enhancer's property, we utilized the H3K27Ac ChIP-seq profiles from the Bernstein lab of the Broad Institute's Human Genome Project [2].  In humans, there can be two scenarios. In b), the synteny remains the same; however, in c, the gene arrangements were changed We assessed the level of H3K27Ac in skeletal muscle myoblasts in humans, as well as H3K27Ac levels in other cell lines, such as GM12878, H1-hESC, and K562 cells. This analysis revealed that many of the RNAs are located in enhancer regions. We further investigated whether the lncRNAs overlapped with the regulatory elements within 5 kb region by integrating CTCF binding sites, promoters, proximal enhancers, distal enhancers from ENCODE for both human and mouse genome. Significant amount of overlapped was found for enhancers regions (proximal and distal enhancers) compared to promoters and CTCF binding sites (Additional file 6: Figure S4). The complete lists of genomic position of lncRNAs with each of the regulatory elements for human and mouse datasets provided as supporting information (Additional file 7). However, we also found that a few RNAs are located in gene desert regions that do not carry the enhancerspecific marker H3K27Ac. One reason for the lack of H3K27Ac marks may be that we have yet to detect the deposition of H3K27Ac marks in C2C12 or KD3 cells or other cell lines; another reason may be that these sites are not typical/canonical enhancers. These sites are distal regions in one-dimensional space but may be closer in three-dimensional space. Moreover, we detected multiple possible lncRNAs in humans for only a few lncRNAs in mice, though we selected only one among the multiple hits based on the distance and log-fold change as well as the expression level. The sequence of these common lncRNAs is not conserved; however, as they are expressed in both mice and humans, they likely have an important role in the structural conformation of the genome during differentiation.

Correlation of the lncRNA expression pattern with epigenetic factors
Because the regulation of gene expression during lineage commitment and differentiation is controlled by dynamic changes in chromatin, we investigated histone modifications that play an essential role in chromatin architecture. To this end, we examined the histone modification profiles obtained by Asp et al.  peaked around the TSS for both nearby genes and lncRNAs ( Fig. 3 a-h). In myotubes, these levels decreased around the TSS. However, a similar pattern was not observed for the H3K18Ac mark ( Fig. 3 i-l): unlike H3K9Ac and H4K12Ac, we did not observe sharp peaks around the TSS for H3K18Ac, and the level did not decrease in myotubes. H3K18Ac deposition on myoblast-specific genes remains the same but increases slightly on myotube-specific genes. Asp    Green lines indicate myoblast-specific lncRNAs and nearby genes. Orange lines represent myotube-specific lncRNAs and nearby genes. Indigo lines indicate gene having no expression. Total number of 500 nearby genes taken into consideration selected in this study are not constitutively expressed. As observed by Asp et al., we also found that the distribution of H3K18Ac was not restricted to regions surrounding the TSS (Fig. 3 i-l). While investigating the repressive marker H3K27me3, we observed a low level on genes expressed during myogenesis (Fig. 4 a-d). In contrast, non-expressed genes exhibited a higher level of H3K27me3 deposition. Supporting this observation, we did not detect accumulation of PolII at these non-expressing genes ( Fig. 1 e and f).
Genome wide mapping of histone deacetylase (HDAC) and Histone acetyltransferases (HATs) in human genome indicated that H3K4 methylation primes chromatin to facilitate histone acetylation and H3K36me2/3 facilitates deacetylation slows elongation [45]. We observed low deposition of H3K36me3 at the TSS of both lncRNAs and nearby genes (Fig. 4 e-h), suggesting that these genes were expressed. Although the level of H3K36me3 remained the same for myotube-specific genes, the level decreased for myoblast-specific genes. As expected, the level of activation marker H3K4 methylation was high at the promoters and gene bodies of active genes (Fig. 5). We observed similar distribution pattern of histones marks in human dataset (Additional file 8: Figure S5, Additional file 9: Figure S6). We overlapped the modified histone marks within 5 kb region of the lncRNA for both mouse and human genome. We observed that the distribution pattern of the modified histone marks in mouse and human are conserved (Additional file 10: Figure S7 and Additional file 11).

Quantitation of lncRNAs in myoblasts and myotubules
We cultured C2C12 skeletal muscle myoblast cells to monitor the differentiation. (Fig. 6a). The cells actively divided and displayed a very clear myoblast morphology (Fig. 6b). The changes in their morphology were monitored at 2 days (Fig.  6c), 5 days (Fig. 6d) and 7 days (Fig. 6e). The cell morphology towards that of a myotube over time, showing a myotubelike morphology on day 7. This result clearly indicates in vitro differentiation of myoblasts into myotubules. Further to evaluate the differentiation of the C2C12 cells, we quantified the expression of well-known genes involved in the myogenesis process. We quantified the expression of Myf5 and MyoG gene in mouse C2C12 cells. The expression patterns of these genes signify the differentiation of C2C12 cells (Additional file 12: Figure S8). To quantify the computationally identified lncRNA expression in mouse C2C12 cells, we measured two lncRNAs: Gm28653 and 2310043M15Rik. C2C12 cells were collected at different stages of differentiation, total RNA was isolated, and expression of both lncRNAs was quantitated by reverse transcriptase PCR (RT-PCR); Gapdh and β-actin were used as loading controls to which all samples were normalized. Three primers were used for each: Gm28653, 53-1, 53-2 and 53-3; 2310043M15Rik, Rik-1, Rik-2 and Rik-3. All primers showed unique products, as determined by a single peak of the melting curve (Additional file 13: Figure S9), and all the samples were normalized relative to the 10% BSA control. Expression of lncRNA Gm28653 was downregulated in myoblasts growing in 20% FBS, but its level gradually increased over time, with maximum expression on day 7 (Fig. 7a). For 2310043M15Rik, it was also down-regulated in 20% FBS, suggesting lower expression in myoblasts. However, its expression increased with the duration of low-serum treatment, with the highest level also on day 7. These experiments reflect the association of these lncRNAs with the myogenic differentiation of C2C12 cells (Fig. 7b).

Association of lncRNAs with genes in 3-dimensional space
Few earlier methods have been able to assign the function of lncRNAs based on the activity of nearby proteincoding genes [31] or co-expressed neighbouring coding genes [32,33], which may be helpful if the lncRNA and protein-coding gene are close to each other. In this study, many of the identified lncRNAs were found to be distally located in gene desert areas. Therefore, previous approaches may not be applicable for assigning function based on nearby genes. Nonetheless, if we consider the three-dimensional chromatin architecture, it may be possible to identify domains where genes and selected lncRNAs are close to each other in three-dimensional space. Dixon et al. [46] generated Hi-C experimental . c differentiating myotubes after serum deprivation (2% Horse Serum) for 2 days. d differentiating myotubes after serum deprivation (2% Horse Serum) for 5 days. e differentiating myotubes after serum deprivation (2% Horse Serum) for 7 days of media change and induction of differentiation data related to the chromatin structure in mammalian cells, dividing the genome into smaller blocks, modules or domains based on the distance or positional association of the genomic fragments. These domains are termed topologically associated domains (TADs). From this dataset, TADs were identified and overlapped with the location of our lncRNAs. Dixon et al. studied chromatin structures in pluripotent cells, such as mouse embryonic stem cells (mESCs) and human embryonic stem cells (hESCs), and differentiated human IMR90 fibroblasts; they observed that the overall domain structure between cell types is mostly unchanged in both pluripotent cells and their differentiated progeny. Therefore, in our analysis, we first identified TADs from mESCs and overlapped the location of the lncRNAs identified in mouse C2C12 cells, which gave us the associated genes with these lncRNAs in three-dimensional space. Gene ontology study revealed that lncRNAs and associated genes in the TADs are involved in skeletal muscle development process, muscle cell proliferation, muscle cell differentiation, chromosome organizations, histone modifications, developmental process, cellular component organizations. Some of the lncRNAs and associated genes are involved in immune response, metabolic process, cell signaling, multicellular organism development (Additional file 14).

Functional assessment of identified lncRNAs
Ling-Ling Zheng et al. used co-expression strategy to predict the putative function of lncRNAs [47]. Xiaoyue Li et al. combined Gene ontology and an approach of identification of nearby genes (100 kb) that are potentially regulated by lncRNAs [48]. We have integrated three dimensional architecture of the genome (Hi-C), Gene enrichment analysis and nearby genes coexpression strategy to identify the potential functions of the lncRNAs.
Analysis of the differentially expressed nearby proteincoding genes in myotube stage revealed that pathways such as regulation of skeletal muscle adaptation, regulation of myotube differentiation, skeletal muscle organ development, etc. were found to be prominent (Fig. 8). However, in the myoblast stage, we observed transcription and cell cycle-related pathways (Fig. 9). The complete list of associated genes involved in pathways along with p-value is provided as supporting information (Additional file 15). However, TADs were utilized to investigate the expression and function of genes associated with selected lncRNAs, revealing genes in the same TADs to be co-expressed, as previously observed by Soler-Oliva et al. [49]. Genes in TADs containing lncRNAs were mostly enriched in ontologies related to the cell cycle, glucose metabolism, lipid metabolism, cytoskeleton, actin filament, development and differentiation, and transcription.
Genes in TADs upregulated in myotubes were enriched in skeletal muscle development and function, and those upregulated in myoblasts are mainly involved in cell cycle progression and mitosis, which is a characteristic of proliferating myoblasts. Some well-studied lncRNAs involved in genome organization were found to be very highly expressed. For example, H19 is expressed in a markedly higher order. It has been reported that H19 is highly expressed during foetal development and is involved in myocyte glucose uptake, embryonic development and muscle regeneration using chromatin modifiers [37,50,51]. With our approach, we discovered a similar function. Indeed, the genes in the TAD containing H19 are involved in histone modification, glycogen biosynthesis, mitotic nuclear division, Fig. 7 Expression of muscle associated long non-coding RNA (lncRNA) determined by quantitative RT-PCR at different stage of differentiation. a Expression levels of lncRNA Gm28653 by using three different primers 53-1, 53-2 and 53-3 b Expression level lncRNA 2310043M15Rik by using three different primers Rik-1, Rik-2 and Rik-3 during myoblasts culture with 10% FBS, exponentially proliferating myoblasts in 20% FBS as well as differentiating myotubes at 2, 5 and 7 days after serum starvation muscle cell differentiation, cellular protein metabolism, and muscle contraction, and sliding. Similarly, the genes in the TADs of Neat1 and Malat1 were found to be associated with the developmental process and chromosome organization. Earlier studies have demonstrated that Neat1 and Malat1 are essential for the structural integrity of nuclear paraspeckles [52]. In mammalian cell nuclei, the ribonucleoprotein bodies known as paraspeckles play an essential role in regulating the expression of certain genes in differentiated cells through nuclear retention of RNA [53]. These findings indicate that our approach can act as a preliminary method to identify the function of lncRNAs.
In addition to Neat1, Malat1 and H19, we discovered some lncRNAs involved in chromatin organization, for example, Bach2os, Srrm4os, Gm21747, Gm17518, and Gm43672. The TAD containing Gm14635 was found to be populated with H2A histone family genes involved in biological processes and negative regulation of histone H3-K36 methylation. One TAD containing lncRNA Hist2h2bb was identified as containing replicationdependent histone genes along with immune and transcription-related genes. These lncRNAs may also be involved in chromatin reorganization.
Two of the lncRNAs detected, Gm29237 and Gm20342, participate in the cell cycle and metabolic processes in addition to chromatin remodelling (Additional file 1: Table S1). Other lncRNAs were previously demonstrated to be involved in dephosphorylation, intracellular signal transduction, cation transport, and the import of protein into the nucleus mechanism. However, using our approach, we could not assign functions to some lncRNAs, as we were unable to find any associated genes in the TADs (Additional file 1: Table S1).

Discussion
As evidence of the involvement of lncRNAs in the regulation of gene expression in a number of biological processes grows, it is essential to investigate the participation of lncRNAs in different mechanisms. In the current study, we examined expression patterns of lncRNAs and identified conserved lncRNAs involved in myogenesis. We integrated RNA-Seq, ChIP-Seq, regulatory elements, and Hi-C datasets for the detection of differentially expressed lncRNAs and their association with different histone marks, and identified conserved lncRNAs in mouse and human. We designed a criteria to determine conserved lncRNAs across species, and also shed light on the role of these lncRNAs based on the three-dimensional architecture of the genome. This approach revealed that the conserved lncRNAs may play an important role during myogenesis. We first determined lncRNAs differentially expressed in the mouse and human genomes. Regarding expression patterns, we observed a substantial increase in the expression of myotube-specific lncRNAs after differentiation and a decrease in myoblast-specific lncRNAs in differentiated myotubes. The same pattern was detected for nearby genes. Intriguingly, we observed significant upregulation of myotube-specific lncRNAs and nearby genes in myoblasts. We investigated PolII deposition along the TSSs of nearby genes and lncRNAs, and our findings corroborated the observed expression patterns for lncRNAs. Accordingly, these lncRNAs likely had already adopted features of active chromatin before maximal expression in myotubes. This observation was confirmed by the distribution of the trimethylation marker H3K36 in the gene bodies of lncRNAs and nearby protein-coding genes, which signifies the active transcription of genes by RNA PolII. We observed a notable decrease in the level of PolII along myoblast-specific lncRNAs in myotubes and a decrease in histone acetylation during myogenic differentiation. As illustrated in Fig. 3, the levels of H3K9Ac and H4K12Ac decreased drastically in myotubes. Earlier studies have reported a similar pattern [41] and confirmed that the N-terminal tail of this histone is cleaved (at residues 22-23) during ES differentiation [41,54]. This cleavage of histones upon differentiation may explain why we observed a decreased level of these modifications in myotubes. Regardless, we did not detect significant global changes in the trimethylation of H3K4, H3K36, and H3K27 during  (Figs. 4 and 5). Overall, our analysis suggested that the observed expression pattern of lncRNAs correlated with epigenetic marks.
Recent studies have reported that more than 90% of the genome giving rise to RNA [3] is non-functional [2,55], originating from transcriptional noise or the artefacts of sensitive detection methods [56]. To validate the computationally identified lncRNAs in our study, we selected two for experimental verification in C2C12 cells based on their level of expression. Among the lncRNAs detected, known lncRNAs, such as XIST and MALAT, were highly expressed. In contrast, the lncRNAs detected in our study were found to be expressed at a very low level. To determine whether these lncRNAs indeed participate in myogenesis, we cultured C2C12 cells and quantitated expression of Gm28653 and 2310043M15Rik lncRNAs by RT-PCR, finding that the levels were similar to those observed in the computational analysis. Moreover, some of the identified lncRNAs were found to be conserved between humans and mice (Additional file 1: Table S1), suggesting that they are non-random muscle-specific biologically functional lncRNAs.
Because the identified lncRNAs are conserved, we determined their possible function. The function of lncRNAs has been inferred by exploring relationships between lncRNAs and nearby protein-coding genes [31], and functions have been predicted by identifying coding genes co-expressed with lncRNAs [32,33]. In this study, we first investigated the gene ontology of nearby genes because lncRNAs are known to exhibit enhancer-like transcription-dependent activation or repression of neighbouring protein-coding genes [6,57]. The overall gene ontology of the nearby genes showed enrichment in cellular processes, metabolic processes, biological regulation and developmental processes in both human and mouse datasets.
In addition to assign a function to the lncRNAs, we investigated lncRNAs in light of chromatin threedimensional architecture data from Dixon et al. [46]. The ontologies of genes sharing the same topological domain with lncRNAs can help determine the role of the lncRNAs. One of the mechanisms by which lncRNAs control gene expression is the scaffold transcript, which provides binding sites for several RNA-binding proteins that can recruit chromatin-modifying enzymes [58,59]. For example, HOTAIR can recruit Polycomb Repressive Complex 2 to its 5′ end, followed by the generation of the H3K27me3 silencing mark, whereas its 3′ terminus can interact with the LSD1/CoREST/REST complex [59]. In the present study, we detected previously reported lncRNAs that form a scaffold for chromatin architectural changes, such as NEAT1, XIST, Malat1, and H19 ( Table  1). The ontologies of the genes co-existing in TADs are involved in chromatin and chromosome organization, covalent chromatin modification, regulation of chromatin organization, regulation of histone modification, and differentiation and development. We discovered 22 lncRNAs that colocalized with the chromatin-modifying genes in the same TAD, suggesting that these lncRNAs may have a similar mechanism of gene regulation as NEAT, XIST, and MALAT1, among others (Additional file 1: Table S1). The occurrence of lncRNAs and chromatin-modifying genes in the same TAD may suggest that the protein product of these genes is recruited by the lncRNA, which is nearby. Most of the lncRNAs we identified as being common between mice and humans are located in enhancers. It has been reported that enhancer RNAs [57,60] are transcribed from enhancers and control gene expression by affecting looping between enhancers and promoters [61,62]. Therefore, these lncRNAs in enhancers may adopt mechanisms of nucleosome positioning, chromosome looping, guide or decoy lncRNAs. Although whether the mechanisms by which lncRNAs control chromatin structure are conserved across species has yet to be determined, it is clear from our study that lncRNAs are conserved between different species, despite very low sequence similarity. This conservation may indicate that the process of chromatin structure control by the identified lncRNAs is mechanistically conserved among species. Such changes in chromatin organization directly affect transcription factor binding and RNA polymerase activity. Nonetheless, it is difficult to suggest which mechanism the remaining lncRNAs (Additional file 1: Table S1) adopt. Additionally, we discovered that some lncRNAs are engaged in cell cycle processes and metabolism. To confirm that these lncRNAs are specific to myogenesis, we crosschecked their expression in NONCODE [63] and GTEX-Portal [64] and found that 21 of 57 lncRNAs are also expressed in the heart, hippocampus, liver, lung, spleen, and thymus. The remaining lncRNAs might be more specific for muscle.

Conclusions
In summary, by integrative data analysis approach, we identified 57 differentially conserved lncRNAs in humans and mice. Studies are required to investigate the reason for the conservation of lncRNAs in humans and mice, even though their sequences are dissimilar. The lack of conservation of the lncRNA sequences may indicate that the mechanism of the lncRNAs recruiting other proteins using the motif-based binding is unlikely. However, since these lncRNAs have moderate structural conservation, chromatin structural changes may be introduced by these lncRNAs, which may regulate chromatin accessibility by the transcriptional machinery. Our analysis provides insight on the conservation of lncRNAs between human and mice and their functional annotation in myogenesis.

Methods
Myogenic transcriptomic data were analysed to identify differentially expressed lncRNAs. The filtered and processed reads were aligned to the reference genome. We then estimated the transcript abundances of the alignments using RNA-Seq Expectation-Maximization Method and selected differentially expressed genes with at least a 1-fold change and (False Discovery Rate correction <= 0.05) for downstream analysis. Expression was correlated with the histone modification study performed using Chip-seq datasets for H3k4me2, H4k20me1, H3k4me3, H3k4me1, H3k27me3, H3k36me3, H3k9ac, H3k79me2, H3k9me3, and H3k27ac for myoblasts and myotubes. Additionally, we performed real-time quantitative PCR to verify expression of muscle-associated lncRNAs at different stages of differentiation. Finally, we identified lncRNAs conserved between humans and mice and assessed their functional roles by overlapping the lncRNAs in TADs and investigating the ontologies of associated genes.
File processing and quality control of the dataset Raw SRA files were converted to FASTQ files using SRA Toolkit. Low-quality reads and adapter sequences were trimmed, and other sequencing errors (polyX detection, overlapping) were removed using the AfterQc program(v0.9.7, [65]). The program identified low quality reads if it meet at least one of following criteria: 1) too high or too low of mean base content percentages (i.e. higher than 40%, or lower than 15%); 2) too significant change of mean base content percentages (i.e., ±10% change comparing to neighbour cycle); 3) too high or too low of mean GC percentages (i.e. higher than 70%, or lower than 30%); 4) too low of mean quality (i.e. less than Q20).

Alignment of the reads
The processed reads were aligned by the RNA-Seq aligner STAR(v2.5) [66] with the Ensemble Human reference annotation (GRCh38) and Ensemble Mouse reference annotation (MM10), respectively. We used the parameter '-outFilterMismatchNmax 10 -outFilterMis-matchNoverReadLmax 0.07 -outFilterMultimapNmax 10' to accurately align the reads and identify lncRNAs [43]. The STAR aligner is suitable for aligning longer reads with high mapping accuracy and is designed to align non-contiguous sequences directly to the reference genome, which contributes to transcriptome studies by providing more complete RNA connectivity information.
We detected more lncRNAs (613) in the "relaxed" approach than in the "stringent" (204) approach. A total of 150 lncRNAs overlapped between the two methods, and these lncRNAs were found to be statistically significant (FDR < =0.05).

Quantifying transcript abundances from datasets
We used RSEM (RNA-Seq by Expectation Maximization) (v1.2.31) to quantify gene and isoform abundances from the paired-end RNA-Seq dataset. RSEM is a software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq datasets that computes maximum-likelihood abundance estimates using the expectation-maximization (EM) algorithm as its statistical model. The directed graphical model can represent the statistical model used by RSEM. After convergence, RSEM outputs ML values, as well as the expected value of the number of RNA-Seq fragments derived from each transcript, given the ML parameters. A typical run of RSEM consists of just two steps: generation of a set of reference transcript sequences and alignment to reference transcripts. The resulting alignments are used to estimate abundances and their credibility intervals [67].
The reference genome for human (version GRCh38) and mouse (version MM10) were built by using the rsem-prepare-reference script. We then used STAR to perform transcriptome-based mapping, and gene expression was calculated from STAR-generated BAM files by rsem-calculate-expression scripts.
Liu Y et al. [39] used a C2C12 cell line model to study myogenesis and regeneration, whereby the cells were allowed to differentiate from myoblast precursor cells into myotubes, followed by identification of genes that were up-regulated and down-regulated during the differentiation process. The microarray datasets were filtered based upon the presence of expressed genes in both myoblasts and myotubes. Genes that were not expressed were not considered for correlation analysis.

Identification of differentially expressed lncRNAs
The edgeR package(v3.30.0) [68], which depends upon count-based expression data for determining differential expression, in R was used for differential gene expression analysis. An overdispersed Poisson model was used to account for both biological and technical variability. Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference. We removed transcripts with zero expression in the samples for both the human and mouse models. Normalization of the counts was performed by using the calcNormFactors function of edgeR, which normalizes for RNA composition by finding a set of scaling factors for the library sizes that minimize the log-fold changes between samples for most genes [68]. We used the TMM method of normalization for both datasets.

Cell culture and myogenic differentiation
The myogenic mouse C2C12 cell line was maintained in growth medium, i.e., Dulbecco's modified Eagle's medium (DMEM) supplemented with 10% foetal bovine serum (FBS) and 25 mM Hepes, in a 5% CO 2 atmosphere at 37°C and 95% humidity. For rapid proliferation, the medium was changed to 20% FBS for 24 h, and 70-80% confluence was attained. The proliferating cells were then switched to a differentiation medium (DMEM containing 2% horse serum) that was subsequently changed every 48 h. Samples were taken on day 2, day 5 and day 7.

RNA isolation and cDNA synthesis
For extraction of total RNA, cells were collected during growth in 10% FBS and at the exponentially growing phase under high-serum conditions (20% FBS) as well as at 2 days, 5 days and 7 days after the medium shift with differentiation medium. RNA was isolated using RNeasy Mini Kit (Qiagen, USA) according to the manufacturer's protocol. cDNA was prepared from 1 μg of total isolated RNA using an iscript cDNA synthesis kit (BIO-RAD, USA).

Real-time quantitative PCR
Real-time quantitative RT-PCR was carried out using SYBR green (SOLIS BIODYNE, EUROPE). Samples were run in triplicate with 7.5 ng of cDNA with customdesigned lncRNA primers and a STEP ONE PLUS™ REAL TIME PCR system (Applied Biosystems). The sequences of the 9 primers are provided (Additional file 16: Table S3).
The PCR programme consisted of a holding stage of 95°C for 15 min, followed by 45 cycles of 15 s at 95°C, 20 s at 60°C and 30 s at 72°C, with 1 h of melting curve stage. GAPDH and ß-Actin were used as internal controls. Relative expression was determined using the eq. 2 −dCT , where dC T = (C target − C control ).

Annotation of differentially expressed transcripts
DE transcripts were selected based on log2 transformation Fold Change + − 2 and False Discovery Rate < = 0.05. lncRNAs and protein-coding genes were taken into consideration based on these criteria. The lncRNA length > =200 bp was further taken into consideration. The transcripts were annotated using the Ensemble Human reference annotation (GRCh38) GTF file and Ensemble Mouse reference annotation (mm10) GTF file for the human and mouse datasets, respectively. The coding potential of DE lncRNAs was analysed using Coding Potential Assessment Tool (CPAT) [69], which applies a logistic regression model that rapidly recognizes coding and noncoding transcripts from a large pool of candidates. The logistic regression model consists of four features: open reading frame size, open reading frame coverage, Fickett TESTCODE statistic and hexamer usage bias. CPAT is highly accurate and much faster than other coding potential identification tools. The software accepts input sequences in either FASTAor BED-formatted data files. We used the human coding probability (CP) cutoff of 0.364 (CP > =0.364 indicates coding sequence; CP < 0.364 indicates noncoding sequence) and the mouse coding probability (CP) cutoff of 0.44, per the recommendation by CPAT.

Identification of protein-coding genes near lncRNAs
We identified protein-coding genes (both human and mouse) that are in close proximity to the identified lncRNA, within the range of 10 kb. The chromosome locations of those genes were extracted and visualized along with the lncRNA genomic positions in Integrated Genome Browser.

Functional annotation of lncRNA
The lncRNAs were annotated using Blast2go pro software [70]. The Gene Ontology database was used for the identification of biological processes involved in developmental processes. HiCExplorer(v1.8.1) [71] software was employed for the identification of boundary positions based on available Hi-C data. Using these highresolution TAD boundaries, we identified and annotated genes present within TADs. TADs were determined from a high-resolution Hi-C matrix (20 kb) with the following parameters: binSize 20 kb, minDepth 60,000, maxDepth 12,000, step 20,000, threshold 0.05.
The quality control procedure filtered out reads with poor quality from raw FASTQ files. The processed reads were aligned with the GRCh38 human genome by the STAR aligner. The MM10 mouse genome was used for alignment of the mouse dataset. The processed BAM files generated by the STAR aligner were sorted by coordinates and reheadered using Samtools.(v1.8, [72]).

Visualization of DE genes/lncRNAs by integrating RNA-Seq and ChipSeq datasets
To explore the regulation/expression of lncRNAs and protein-coding genes involved in histone modification, we used the ngs.plot program [73]. The list of identified lncRNAs was incorporated for all histones in the human and mouse genomes. We classified the TSS region with a flanking region of 3000 bp. The program utilizes both the RNA-Seq dataset and ChipSeq dataset with arguments "ngs.plot.r -G hg38 -R tss -F chipseq,lincRNA -L 3000" and "ngs.plot.r -G hg38 -R tss -F rnaseq,lincRNA -L 3000". Transcripts with zero expression were also plotted to check the expression of lncRNAs and proteincoding genes.
Comparative study of lncRNAs identified in the mouse and human genomes We developed a method to compare the lncRNAs detected in the mouse (mm10 mouse genome) and human (GRCh38) models based on the protein-coding genes near those lncRNAs. These genes were regarded as the reference genes, and transcripts with maximum logCPM values were chosen. The sequences of identified lncRNAs were extracted, and we used a pairwise alignment strategy to detect sequence variation among mouse and human lncRNAs involved in developmental processes.

Availability of data and materials
a) Mouse transcriptomic dataset: mRNA-seq raw sequence data from Trapnell et al. [ ChIP-seq datasets used for the analysis of human samples were downloaded from the GEO database (GEO Accession ID: GSE19465, https://www.ncbi.nlm.nih.gov/ geo/query/acc.cgi?acc=GSE19465).